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Preface 



This volume of the Springer Lecture Notes in Computer Science series contains the 
contributions presented at the International Symposium on Knowledge Exploration in 
Life Science Informatics (KELSI 2004) held in Milan, Italy, 25-26 November 2004. 
The two main objectives of the symposium were: 

• To explore the symbiosis between information and knowledge technologies and var- 
ious life science disciplines, such as biochemistry, biology, neuroscience, medical 
research, social sciences, and so on. 

• To investigate the synergy among different life science informatics areas, including 
cheminformatics, bioinformatics, neuroinformatics, medical informatics, systems bi- 
ology, socionics, and others. 

Modern life sciences investigate phenomena and systems at the level of molecules, 
cells, tissues, organisms, and populations. Typical areas of interest include natural evo- 
lution, development, disease, behavior, cognition, and consciousness. This quest is gen- 
erating an overwhelming and fast-growing amount of data, information, and knowledge, 
reflecting living systems at different levels of organization. Future progress of the life 
sciences will depend on effective and efficient management, sharing, and exploitation 
of these resources by computational means. 

Life science informatics is fast becoming a generic and overarching information 
technology (IT) discipline for the life sciences. It includes areas such as cheminformat- 
ics, bioinformatics, neuroinformatics, medical informatics, socionics, and others. While 
the precise scientific questions and goals differ within the various life science disci- 
plines, there is a considerable overlap in terms of the required key IT methodologies 
and infrastructures. Critical technologies include databases, information bases (i.e., 
containing aggregated, consolidated, derived data), executable models (i.e., knowledge- 
based and simulation systems), and emerging grid computing infrastructures and sys- 
tems (facilitating seamless sharing and interoperation of widely dispersed computa- 
tional resources and organizations). These base technologies are complemented by a 
range of enabling methodologies and systems such as knowledge management and dis- 
covery, data and text mining, machine learning, intelligent systems, artificial and com- 
putational intelligence, human-computer interaction, computational creativity, knowl- 
edge engineering, artificial life, systems science, and others. 

This symposium was a first step towards investigating the synergy of these knowl- 
edge and information technologies across a wide range of life science disciplines. 



Milan, Italy, November 2004 



Jesus A. Lopez 
Emilio Benfenati 
Werner Dubitzky 
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A Pen-and-Paper Notation 
for Teaching Biosciences 

Johannes J. Mandel 1,2 and Niall M. Palfreyman 1 

1 Dept, of Biotechnology and Bioinformatics, 

Weihensteplian University of Applied Sciences, Freising, Germany 
{niall . palfreyman, johannes .mandel}@fh-weihenstephan . de 
2 School of Biomedical Sciences, University of Ulster, Coleraine, Northern Ireland 

Abstract. The authors introduce a graphical notation for representing 
general dynamical systems and demonstrate its use in three commonly 
occurring systems in the biosciences. They also indicate how the notation 
is used to facilitate the acquisition and transfer by students of skills in 
constructing equations from a verbal description of a system. 

1 Modelling in the Biosciences 

In her book “Making Sense of Life”, Evelyn Fox Keller [1] recounts a confronta- 
tion at the 1934 Cold Spring Harbour Symposium on Quantitative Biology be- 
tween Nicolas Rashevsky and Charles Davenport concerning Rashevsky’s [2] 
mathematical model of division in an idealised spherical cell. Davenport’s com- 
ment on the model was: 

“I think the biologist might find that whereas the explanation of the 
division of the spherical cell is very satisfactory, yet it doesn’t help as 
a general solution because a spherical cell isn’t the commonest form of 
cell.” 

which elicited the following retort from Rashevsky: 

“It would mean a misunderstanding of the spirit and methods of math- 
ematical sciences should we attempt to investigate more complex cases 
without a preliminary study of the simple ones.” 

What we observe in this altercation is a deep-set cultural division between bi- 
ologists and mathematical scientists, and one which must be experienced at some 
level by any student entering a degree programme in a discipline combining biol- 
ogy with the mathematical or technical sciences. There is a mildly schizophrenic 
atmosphere about such programmes arising from the diverse approaches of the 
two groups of scientists: The biologist must learn early in his career that living 
systems are inherently complex - too complex to hope to understand or explain 
them in all their gory detail. The engineer on the other hand develops during 
her training a confidence in her own ability to describe and possibly explain the 
world in terms of relatively simple equations. Whereas the biologist learns to 
accept a provisional lack of explanation, the engineer learns to need to explain. 
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The result of this division is that the student of the mathematical or technical 
biosciences is pulled in two conflicting directions: she is required on the one hand 
to develop a deep appreciation of the complexity of living systems, yet must 
simultaneously become adept in the technical skill of modelling this complexity 
mathematically at a level which admits tractable solution. The central skill that 
this student must learn is therefore to abstract from a given biological system 
the essential mathematical structure. 

Our experience is that bioscience students often have difficulties in learning 
this skill, and that these difficulties stem from a single question which is repeat- 
edly voiced by our students: “I know how to solve the equations, but 1 have no 
idea how to derive these equations from a physical description of the problem!” 

In this article we offer three components of a solution to this problem: 

1. We propose a pen-and-paper graphical notation (mutuality nets) for de- 
scribing the dynamical structure of a system. Mutuality nets emphasise the 
structural similarities between different systems, thus enabling the transfer 
of knowledge between systems. 

2. We define an unambiguous procedure for transcribing mutuality nets into 
mathematical equations. 

3. We illustrate the use of mutuality nets by using them to formulate three 
design patterns for situations commonly arising in the biosciences. “Each 
pattern describes a problem which occurs over and over again in our en- 
vironment, and then describes the core of the solution to that problem, in 
such a way that you can use this solution a million times over” (Christopher 
Alexander, quoted in [3]). 

In section 2 we use the Rain-barrel pattern to demonstrate how mutuality nets 
portray the generic dynamical structure in a variety of structurally similar sys- 
tems, and how this structure can be used to derive a mathematical model. In sec- 
tion 3 we formulate the Investment pattern, which describes catalytic processes, 
and in section 4 we use the Delayed balancing pattern to describe the dynamical 
structure of oscillating systems. Finally, in section 5 we discuss briefly how mu- 
tuality nets are woven into a currently running course in bioprocess engineering. 

2 Rain-Barrel: Using Feedback to Seek Equilibrium 

Mutuality nets arose out of teaching a first course in bioprocess engineering, 
where almost every equation can be derived in one of two ways - as a balance 
equation for the processes affecting some state variable (stock) s: 

s = (sum of input processes) - (sum of output processes) (1) 

or as a rate equation for the stocks s* coordinated by a process p: 

_ Sl _ S2 _ _ Si _ Sj+i 

a\ a 2 a* 



Oi+1 



(2) 
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A mutuality net links these two kinds of equation in a network of interacting 
stocks and processes. It is a straightforward adaptation of stock and flow dia- 
grams [4] and Petri nets [5], [6], [7] for use in the biosciences; it has been discussed 
elsewhere [8], [9], and will be described in detail in a forthcoming paper. 

To see how mutuality nets are used in teaching, let us use them to represent 
the very simple system of a leaky rain barrel (fig. 1), into which water runs at 
a constant rate, but whose contents leak out at a rate which is proportional to 
the current volume of water in the barrel. This model displays a wide variety 
of behavioural intricacies which fascinate students - see [10] for an extensive 
pedagogical discussion of the rain-barrel. 



i 



filling! /) 




Water volume( V) 



►o 

leaking = kV 



Fig. 1 . The rain-barrel model 



The first thing to notice here about the rain-barrel model is its wide appli- 
cability to biological systems. The following is just a short list of systems whose 
dynamical structure matches that of the rain-barrel: 

— Infusion and subsequent uptake of medication in the blood system. 

— mRNA / protein synthesis and degradation. 

— Substrate levels in a continuous-feed bioreactor. 

— Heating of a body and heat loss to environment. 

— Approach to terminal velocity in a falling body. 

— Growth of a feeding organism with energy loss through respiration. 

Once a student has understood the behaviour of the rain-barrel model, he 
has little trouble in transferring this knowledge to any of the above situations. 
In this way mutuality nets facilitate transfer by visually representing the essen- 
tial dynamical structure common to all of them. In addition this representation 
facilitates thinking, discussion and the exchange of views by lending itself to 
simple pen-and-paper constructions. 

To obtain the dynamical equation of the rain-barrel system, we transcribe 
the above diagram into mathematical notation. This is done by treating each box 
(e.g.: V in the above diagram) as a state variable, and each cloud (e.g. filling 
and leaking) as a process which either augments or depletes the value of the 
state variables to which it is connected by an arrow. The circle notation means 
in the case of the rain-barrel that V is also an information source for the leaking 
process, thus making V available to appear in the equation leaking = kV. In this 
way we find the balance equation V = f — kV , which can easily be solved either 
analytically or numerically by students to find the typical equilibrium-seeking 
behaviour of the rain-barrel shown in fig. 2. 

Experimenting with the rain-barrel pattern makes clear to students the im- 
portance of feedback for system behaviour, since it is precisely the feedback 
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nature of the interaction between V and leaking in the rain-barrel which leads 
to its distinctive behaviour. By experimenting with the feedback constant k, 
they discover for themselves how it affects the convergence rate of the basic 
equilibrating behaviour. 

3 Investment: Eliminating Idols 

In our next example we shall see how the syntax rules of mutuality nets can aid 
students in the derivation of the Michaelis-Menten (M-M) equation [11] for the 
enzymatic splitting of a single substrate S. 

Our first approximation to the M-M system is a simple adaptation of the 
rain-barrel model in which we assume that the enzyme E is an idol of the 
system in the sense that it is a state variable which affects the reaction, but 
without itself being affected by the reaction. Such a system might be denoted as 
in fig. 3. Here the constant value E conditions the process splitting according 
to the function kES , where k is the rate constant for the reaction. 




Fig. 3. Syntactically incorrect model of the M-M system 



In fig. 3 we distinguish between two kinds of influence: cause and condition. 
A cause (thick, straight arrow) denotes an incremental flow of quantity between 
a stock and a process; a condition (thin, curved connector) makes available the 
value of its source (denoted by the small circle) to its target. We say that splitting 
causes changes in the substrate level S, and that E conditions this splitting. 
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Of course, the only problem with this model is that it is physically incorrect! 
It treats the product kE as the rate constant in an exponential process splitting, 
whereas in reality this is only a part of the complete enzymatic process - even 
the behavioural curve arising from this model is incorrect. 

At this point we offer students as a guide the mutuality rule which, al- 
though not part of the syntax of mutuality nets, nevertheless constitutes a strong 
recommendation, particularly when using mutuality nets to model biological sys- 
tems: A condition should only connect one process to another process. The effect 
of this rule is to discourage the formation of idols such as E in a dynamical 
model; if we wish E to condition the splitting process, then we should usually 
connect them with a cause, thereby at least admitting the possibility of a mutual 
interaction between splitting and E (hence the name “mutuality rule”). 

This is, of course, the case in reality, since E actually effects the splitting by 
physically investing itself in the splitting process. Yet it is also the case that the 
quantity of E in the system remains unaffected when the reaction is complete. In 
order to combine these two requirements, we are compelled to introduce a new 
state variable representing the transitory enzyme-substrate complex ES. This 
leads us to the physically correct model shown in fig. 4. 




Fig. 4. Syntactically correct model of the M-M system 



On the basis of this corrected model it is simple to first transcribe the com- 
plete dynamical equations for the M-M system: 

7 Q 

— = ( kBackMES ) - (kFwd)(E)(S) 

at 

rl F 

— = (kBack + kCatMES) - (kFwd)(E)(S) 

at 

EJL = (kFwd)(E)(S) - (kBack + kCatMES) 
dt 

and then if required deduce the M-M equation by imposing the condition ES = 
const and defining the M-M constant K m = ( kBack + kCat) /kFwd. 

From our consideration of the M-M system we have made two discoveries. 
First, the mutuality rule that no state variable can purely condition a process 
helps us to formulate a physically realistic mathematical model of the system - 



( 3 ) 

( 4 ) 

( 5 ) 
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something students often need help with. Second, the cyclical structure of the 
M-M model in fig. 4 is again a pattern commonly found in the biosciences, which 
we call “Investment” . This pattern represents any situation where something 
is invested in the short term in order to return itself plus a payoff in the long 
term. Examples of the Investment pattern are: 

— Investment of energy by organisms in foraging activities in order to gain 
energy from food. 

— In the cell, phosphorylation of ADP represents an investment which is re- 
turned on hydrolysis of ATP, and which transports energy in the process. 



4 Delayed Balancing: Creating Oscillations 

In this section we shall introduce one final model which illustrates the relation- 
ship between oscillatory behaviour and feedback delays [12]. Imagine modifying 
the basic rain-barrel pattern by introducing a delay in the availability of infor- 
mation regarding the current level of water in the barrel. In this case the leak 
responds not to the current water level, but to some prior level, and the result is 
that the behaviour becomes no longer a direct convergence to equilibrium, but 
instead an oscillation about equilibrium as shown in fig. 5. 




We can see how such a delay in the equilibrating feedback can lead to oscil- 
latory behaviour, but how do feedback delays arise in the first place? A typical 
way in which delays can occur is if a process depends not merely upon feedback 
from its source, but is also modulated by feedback from its effects, as in the 
Lotka-Volterra model [13] of fig. 6. 

The important point in the Lotka-Volterra model is that rabbits are increased 
by birthing and foxes are reduced by dying , but both of these effects are coun- 
tered by the process of interacting between the two populations. So where does 
the delay come in interacting is conditioned by two balancing effects R and F, 
which react only sluggishly to changes caused by the interacting process. The 
oscillations of the Lotka-Volterra predator-prey model are well-known, and result 
directly from the delay thus introduced. 
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birthing = 0.05 * R 







interacting 


5 * R 


) V 





F * R 



- 



-N> 



Foxes ( F) 

Fig. 6. Delay in the Lotka-Volterra model 




dying = 0.1 * F 



While somewhat vaguely formulated, the idea of oscillations caused by de- 
layed balancing effects is sufficiently common that we may consider it a design 
pattern which we call “delayed balancing” . We can observe this pattern in sys- 
tems such as: 

— the simple pendulum, where deflection of the bob has the delayed balancing 
effect of reductions in the bob’s momentum; and 

— the Belousov-Zhabotinsky reaction [14], where reaction of I + ions to C102 + 
ions causes delayed increases in the reaction back to I + . 

5 Using Mutuality Nets in the Classroom 

To close, we shall briefly describe how we use mutuality nets within the context 
of an introductory degree course in bioprocess engineering. This course covers a 
spectrum of processes relevant to the dynamics of bioreactors, including trans- 
port, chemical reaction, heat flow, volume and concentration flow, and growth. 

In each lecture of the course students are consistently introduced to a new 
process in one concrete context, but always using the mutuality net notation, 
thus facilitating transfer of learning to other analogous contexts. As an example, 
in one lecture students are introduced to the general concept of flow using the 
concrete example of an LRC electrical circuit. This circuit is represented as a 
mutuality net and the occurrence of oscillations in the model is investigated. 
Then an identical structure is used to describe the motion of a mass on a spring, 
and then again for the flow of water in pipes. In all three cases students are 
encouraged to notice the isomorphic nature of the dynamics, and to solve prob- 
lems in one system by looking for analogous structure in other systems, as in 
the following exercise: 

If you suddenly turn off the tap in a very old house, you sometimes 
hear “hammering” in the pipes. Explain this phenomenon and describe 
a solution which will reduce the hammering. 

Students solve this problem by being aware of the commonality of structure 
between the electrical circuit and mass flow in pipes. Since they know that 
increasing the capacitance in the circuit will reduce the frequency of oscillations, 
they look for the analogous solution of introducing additional capacity into the 
pipe system. 

It is our experience that by teaching bioprocess engineering in this way, stu- 
dents become proficient in making use of the transfer of skills between different 
systems, and also in the construction of the equations which describe these sys- 
tems. 
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Abstract. The recent availability of technologies for high throughput 
proteome analysis has led to the emergence of integrated mRNA and 
protein expression data. In one such study by Ideker and co-workers, 
changes in mRNA and protein abundance levels were quantified following 
systematic perturbation of a specific metabolic pathway [1]. The authors 
calculated an overall Pearson correlation coefficient between changes in 
mRNA and protein expression of 0.61, however, no change in protein 
expression was observed for almost 80% of genes reported as having a 
significant change in mRNA indicating that a complex relationship exists 
between mRNA and protein expression. To try and address this issue, 
the data were sorted according to various criteria : protein and mRNA 
expression ratios, confidence values, length of protein, fraction of cys- 
teine residues and half- fife prediction, to try and identify any bias in 
experimental technique which may affect the correlation. mRNA expres- 
sion ratio and the confidence value had the strongest affect on how well 
the data correlated, whilst protein detection was weakly dependent on 
the fraction of cysteine residues in the protein. Initial investigations have 
indicated that integrating the data with domain knowledge provides the 
best opportunity for distinguishing between those transcriptome results 
which may be interpreted in a straightforward manner and those which 
should be treated with caution. 



1 Introduction 

It is widely predicted that the application of global technologies to the analysis 
of biological molecules will mark a breakthrough in our understanding of biolog- 
ical processes. One of the first studies to take a systems approach was conducted 
at the Institute of Systems Biology [1]. Having defined the galactose pathway 
in yeast as their system, they developed a model of the structure, i.e. the in- 
teraction of the genes, proteins and other molecules involved. The second step 
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was to perturb the system and then monitor the changes in mRNA and proteins 
expressed. By monitoring the response of these entities they were then able to 
refine their model of the underlying structure of the system. 

The authors reported a Pearson correlation coefficient of 0.61 between changes 
in protein and mRNA abundance, suggesting that mRNA may not be a reliable 
predictor of protein. This supported results from two earlier studies [2, 3] where 
there was a poor correlation between mRNA and protein for all but the most 
abundant proteins; it was concluded that these earlier results were due to the 
limitations in methods for quantitative analysis of the proteome. Whilst the 
two groups used different methods to quantify and identify proteins, two di- 
mensional gel electrophoresis (2-DE) was used to separate the proteins in both 
studies. 2-DE has well known biases against large, small or highly charged pro- 
teins, proteins difficult to solubilize and less abundant proteins [4]. The data set 
published by Ideker used Isotope Coded Affinity Tags (ICAT) [5] to quantify the 
proteome. The ICAT method is based on labeling cysteine with a heavy or a light 
affinity tag. The mixed population is then trypsinised and fragmented allowing 
relative quantification of proteins in each population by mass spectrometry. This 
technique offers an accurate and rapid method for identifying changes in protein 
expression between two samples. 

The correlation will also be dependent on the quality of the mRNA data. In 
early microarray studies, a gene was said to be differentially-expressed if the ratio 
of its expression level in the test condition to the control condition exceeded some 
threshold value. However, this approach made it difficult to identify changes in 
expression for genes expressed at low levels. Ideker and co-workers [6] developed 
an error model which relates actual intensities to observed intensities. System- 
atic errors caused by variation in labelling efficiency, spot size or uniformity, 
or errors introduced during the hybridization process are described using both 
multiplicative and additive errors. The error model is used to calculate a likeli- 
hood statistic A, for each gene, which is used to determine whether intensities 
are significantly different. 

In this paper the correlation between changes in protein abundance (mea- 
sured using the ICAT technique) and changes in mRNA (based on their A) was 
investigated to try and identify whether the ICAT technique was biased against 
certain categories or classes of proteins. 

2 Materials and Methods 

The original study measured changes in expression level of approximately 6200 
nuclear yeast genes through comparing flourescently labeled cDNA from a per- 
turbed strain with that from a reference strain. Following four replicate hy- 
bridizations, 118 genes were identified as having a significant change in mRNA, 
i.e. the likelihood statistic, A exceeded the threshold value obtained in control ex- 
periments in which mRNA was extracted from two identical strains grown under 
identical conditions. The difference in protein abundance between the perturbed 
and the control condition was determined using ICAT. The resulting peptide 
mixture was fractionated and analysed using MS/MS to identify those proteins 
with significant changes in expression. 
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Various statistical techniques have been used to correlate mRNA and protein 
data. One group [3] propose that the Spearmann rank correlation should be used, 
however this method can produce inaccuracies in data sets for which there are a 
large number of ties, as is the case for the data set analyzed here, particularly at 
low mRNA and protein levels. Other groups [2] advocate the use of the Pearson 
product-moment , however this method is not robust against deviations from 
a normal distribution. Neither of the data sets based on 2-DE [2,3] followed a 
normal distribution due to a bias towards the highly abundant proteins. The 
logged expression ratios in the ICAT data set show good agreement with a 
normal distribution (Table 1), indicating that low abundance proteins were being 
detected, thus demonstrating a clear advantage of the ICAT technique over 2- 
DE. This also allows the use of the Pearson product-moment coefficient ( r ) in 
the analysis; in each instance the 5% significant r value for the given sample size 
is provided for comparison. 



Table 1 . Statistical analysis of log abundance ratios for mRNA and protein data 
Two subsets of data were identified: those mRNA values which the original authors 
considered significant, i.e /r > 45 (mRNA(s)) and those mRNA values for which a 
protein was observed (mRNA(p)). The y 2 was determined for each group to evaluate 
the fit to a normal distribution. All mRNA data sets and the protein data followed an 
approximately normal distribution (> 95% significant). 



Number Average Variance y 2 



Protein 


289 


-0.0002 


0.044 


8.9 


All mRNA 


5936 


0.08 


0.071 


2.8 


mRNA(s) (A >= 45) 


118 


-0.26 


0.17 


2.4 


mRNA(p) 


289 


-0.12 


0.18 


7.9 



The published data set [1] provided the loglO expression ratios for mRNA 
and protein. A subset of the data comprising all mRNA ratios for which there 
was a measured protein (i.e 289 values) were ranked according to the following 
criteria: (i) Absolute change in protein expression, (ii) Absolute change in mRNA 
expression, (iii) Confidence value, (iv) Fraction of cysteine residues The ranked 
data was then sorted into equal sized bins containing 30 consecutive data points 
per bin. The Pearson correlation coefficient, average and variance were then 
calculated for each bin. 

To determine whether protein length or cysteine content affected the ability 
of the ICAT technique to detect a protein a second data set comprising values 
for which the mRNA had a confidence value greater than 25 was selected. This 
set consisted of 365 mRNA ratios, 72 of which had corresponding protein ratios. 
The data was ranked by length and by percentage cysteine content (i.e. the 
number of cysteine amino acids divided by the length of the protein *100) and 
again sorted into bins. The percentage of mRNA values with a measured protein 
was then determined for each bin. 

The N-end rule determines the in vivo half-life of a protein based on the 
identity of its N-terminal residues [7]. We have applied the rule based on data for 
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Table 2. Half-life prediction based on N-terminal amino acid residue, data for proline 
is ambiguous and therefore not considered in this analysis. 



Short half-life, (< 30 minutes) 


Cys Ala Ser Thr Gly Val Met 


Long half-life, (> 1200minutes) 


Arg Lys Phe Leu Trp Tyr His lie Asp Glu Asn Gin 



the half-life of X-/3-galactosidase in Saccharomyces cerevisiae at 30°C (Table 2), 
to the data set outlined in section 2 above, to identify which proteins would 
be expected to have a short half-life and those which would have a long half- 
life. The fraction of genes for which a protein was measured was determined for 
each group and the correlation coefficient within groups was calculated. 



3 Results 



3.1 mRNA Data 

Two populations of mRNA data were analysed: mRNA for which a protein prod- 
uct was identified (mRNA(p)) and mRNA values considered significant using the 
maximum likelihood method (A value > 45) (mRNA(s)). Figure 1 allows us to 
visualize the distributions for both data sets, from which it can be predicted 
that the data sets follow similar distributions but are different in terms of me- 
dian. The test statistic was used to determine whether there was a significant 
difference in the mean of the two data sets (Table 3): 



Z = 



(x s 



Xp) - (n 8 - lip) 




(1) 



where x is the sample mean, p is the actual mean, cr is the standard deviation, 
n is the population size, subscripts s and p refer to significant mRNA values (s) 
and mRNA values where a protein was identified ( p ). 



Table 3. The Z test statistic was used to determine whether the means of the two 
populations (mRNA with measured protein (mRNA(p)) and the significant values of 
mRNA(s)) were significantly different. Null hypothesis: there is no overall difference 
in the means of mRNA(s) and mRNA(p): p s — p p = 0. The number of degrees of 
freedom is 405, the decision criteria is if —1.96 > Z < 1.96 accept null hypothesis, else 
if Z < —1.96 or Z > 1.96 reject null hypothesis; i.e. there is a significant difference 
between the means of the two populations. From the figures in table 3 , Z = 3.1 as 
Z lies without the acceptance region, it can be concluded that there is a significant 



difference in the means 


of the two populations. 




Number Average STD 95% Confidence interval for mean 


niRNA(s) (A > 45) 


118 -0.26 0.425 


-0.34- -0.19 


mRNA(p) 


289 -0.12 0.422 


LoITitr^ 
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mRNA expression ratios (log 10) 



Fig. 1. Frequency distributions of mRNA expression ratios (loglO) for the whole data 

set (- - -), mRNA for which a protein was measured (mRNA(p))( ) and mRNA 

identified as significant (mRNA(s)) ( ). All data sets approximate to a normal 

distribution. 



The Z statistic lies outside the acceptance region, indicating a significant 
difference between the mean of the significant mRNA values (-0.26) and the 
mean of the mRNA values where a change in protein expression was observed 
(- 0 . 12 ). 

3.2 Affect of Expression Ratios on Correlation Coefficient 

The correlation is dependent on the mRNA expression ratio (Figure 2a), a sig- 
nificant correlation was observed when the average fold change was greater than 
3. Protein expression ratios had a much smaller affect on the correlation, with 
only bins representing the smallest expression ratios having an insignificant cor- 
relation (Figure 2b). This is in contrast with results based on 2-DE for protein 
separation [2,3], where the correlation is dependent on the protein abundance 
measurements. 

3.3 Confidence Values 

The authors of the original paper [1] calculate a confidence value based on 
Maximum-Likelihood analysis to identify differentially expressed microarray 
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Fig. 2. Plot of Pearson correlation coefficients for binned mRNA (a) and protein (b) 
expression ratios (log 10). Each bin contained 30 consecutive values, following ranking 
data by either mR.NA (a) or protein (b). The bars are labeled with the average absolute 
value for the expression ratio. The 5% significance level is given for comparison, bars 
above the line are significant at the 5% level. 
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Fig. 3. Plot of Pearson correlation coefficients for binned confidence values. Each bin 
contains 30 consecutive values, following ranking data by confidence value determined 
by maximum likelihood [6]. The bars are labeled with the bin average for the confi- 
dence value. The 5% significance level is given for comparison, bars above the line are 
significant at the 5% level. 



data [6]. They suggest that genes having a A value > 45 are differentially ex- 
pressed, 45 being approximately the maximum obtained in control experiments 
in which the 2 sets of mRNA values were derived from identical strains under 
identical growth conditions. The data was sorted into bins according to con- 
fidence value, and again the Pearson product moment, average and variance 
determined for each bin. (Figure 3). The only significant correlation (0.8) was 
observed for the bin with a bin average of confidence value of 40, confirming the 
view of the original authors that this represented significant mRNA changes, 
however it was also noted that no signifcant change in protein expression was 
observed for 76% of the genes for which the confidence value was above the 
significance threshold indicated. 

3.4 Cysteine Content 

Approximately 16% of the proteins contained no cysteine amino acids, and there- 
fore would not be detected by the ICAT technique. The chance of identifying 
a protein was weakly dependent (r = -0.4) on the average cysteine content, 
i.e. the higher the cysteine content the less likely it was that the protein was 
identified (Figure 4). Conversely the correlation between mRNA and protein 
improved with the cysteine fraction (Figure 5). The majority of bins showed a 
significant positive correlation between protein and mRNA ratios, however the 
correlation reaches a threshold value of approximately 0.75 corresponding to a 
cysteine content of approximately 1.5%. Similarly, an improved correlation was 
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Fig. 4. Affect of fraction of cysteine residues on Pearson correlation coefficient. All 
nrRNA data with a confidence value above 25 was selected, and ranked according to 
the percentage cysteine (no. of cysteine amino acids/length of protein * 100). Every 30 
consecutive values were placed in a bin, and the number of observed proteins measured 
for each bin. The plot shows a weak but significant negative correlation (r = —0.4 cf 
5% significant level 0.375) between the average number of proteins observed in the bin 
( as a percentage) against the average cysteine content for the bin. 



observed for longer proteins (Figure 6), although the ability to detect proteins 
was independent of protein length. 



3.5 N-Terminal Half-Life Prediction 

A similar percentage of proteins were predicted to have a short half-life in the 
complete data set (34%) as the proteins that were measured (33.5%). This would 
suggest that the ICAT method is not biased against either group. This is in con- 
trast to the data presented by Gygi [2] where only 17% of the proteins measured 
were predicted to have a short half-life. The Pearson correlation coefficient for 
the group of proteins with a short half-life was 0.64 compared to 0.60 for those 
predicted to have a long half-life. 

4 Discussion 

The emergence of integrated proteome-transcriptome studies in the literature, 
is indicating that mRNA is a poor indicator of protein expression levels; clearly 
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Fig. 5. Affect of fraction of cysteine residues on the Pearson correlation coefficient. All 
mRNA data with a corresponding protein value was selected, and ranked according to 
the percentage cysteine ( no. of cysteine amino acids/length of protein * 100). Every 
30 consecutive values were placed in a bin, and the Pearson correlation coefficient 
measured for each bin. 



the experimental methods used to measure protein and mRNA will affect the 
correlation. 2-DE is biased against the accurate detection of very small or very 
large proteins, proteins expressed at low levels and basic or hydrophobic proteins. 
Some of these problems should be overcome by using techniques such as ICAT. 
It is clear from the data that a much larger fraction of low abundance proteins 
were identified using this technique, and a significant correlation with mRNA was 
observed across a wider range of expression ratios. Early studies with microarrays 
used a threshold test to identify differentially expressed genes, i.e. a gene was said 
to be differentially expressed if the change in expression between the control and 
the test conditions exceeded some threshold value. The threshold approach is 
supported by this analysis where an average 3-fold change in mRNA expression 
was required before a significant correlation was observed. However, the size 
of systematic errors compared to changes in expression will be greater for low 
abundance genes than for high abundance genes, thus significant changes in genes 
expressed at lower levels may be missed. The maximum- likelihood method uses 
an error model and significance test to produce confidence values, allowing the 
identification of differentially expressed genes across the whole expression range. 
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Fig. 6. Affect of protein length on the Pearson correlation coefficient. All mRNA data 
with a corresponding protein value was selected, and ranked by length. Every 30 con- 
secutive values were placed in a bin, and the Pearson correlation coefficient measured 
for each bin. 



The effectiveness of this method is demonstrated by the fact that the majority 
of genes in the confidence value bin with the highest Pearson correlation, had 
low mRNA and protein expression ratios. However, the average of the significant 
mRNA ratios had a greater value than those mRNA ratios for which a change 
in protein was identified, possibly indicating that larger changes in mRNA are 
still preferentially selected. It should also be noted that no protein was observed 
for the majority of genes with a significant change in mRNA (76%). 

In this study we have looked at three factors which may affect the abil- 
ity to detect a protein, namely the length, cysteine content and the predicted 
half-life. Of these, only the cysteine content had a weak correlation with the 
number of proteins detected. Unlike other data sets [2,3] neither the ability to 
detect proteins nor the correlation was affected by the predicted protein half-life, 
demonstrating a clear advantage of the ICAT technique over methods based on 
2-DE. 

Integrating the data with Gene Ontology (GO) [8] function classifications al- 
lowed the identification of certain clusters of genes which had a good correlation, 
for example 27 genes annotated by GO as having a known role in carbohydrate 
metabolism had a Pearson correlation coefficient of 0.86. This strong correlation 
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is not surprising as the initial data set was produced through perturbation of 
the galactose utilization pathway. 

Approximately one third of the genes were negatively correlated with protein 
expression. Analysis of this group through integration with GO allowed the iden- 
tification of a cluster of 9 genes associated with the stress response (r = —0.82). 

Many proteins involved in the stress response are regulated post-transcription 
[9]. High mRNA levels and low protein levels are observed when there was evi- 
dence for regulation of translation (SSA3 [9]), whilst high protein and low mRNA 
levels are observed for genes (e.g. PUP2 [10]), in which the mRNA was stored in 
the nucleus and released once protein levels had reached some threshold limit. 

Whilst the development of high throughput techniques such as microarrays 
and ICAT has enormous potential to increase our understanding of biological 
systems, this paper has demonstrated that the realization of this goal is depen- 
dent on confidence in the underlying data and the integration of results with the 
wealth of readily available domain knowledge. 
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Abstract. There is an overwhelming increase in submissions to genomic 
databases, posing a problem for database maintenance, especially re- 
garding annotation of fields left blank during submission. In order not 
to include all data as submitted, one possible alternative consists of per- 
forming the annotation manually. A less resource demanding alternative 
is automatic annotation. The latter helps the curator since predicting the 
properties of each protein sequence manually is turning a bottleneck, at 
least for protein databases. Machine Learning - ML - techniques have 
been used to generate automatic annotation and to help curators. A 
challenging problem for automatic annotation is that traditional ML al- 
gorithms assume a balanced training set. However, real-world data sets 
are predominantly imbalanced (skewed), i.e., there is a large number 
of examples of one class compared with just few examples of the other 
class. This is the case for protein databases where a large number of 
proteins is not annotated for every feature. In this work we discuss some 
over and under-sampling techniques that deal with class imbalance. A 
new method to deal with this problem that combines two known over 
and under-sampling methods is also proposed. Experimental results show 
that the symbolic classifiers induced by C4.5 on data sets after applying 
known over and under-sampling methods, as well as the new proposed 
method are always more accurate than the ones induced from the orig- 
inal imbalanced data sets. Therefore, this is a step towards producing 
more accurate rules for automating annotation. 



1 Introduction 

Automatic annotation in genomics and proteomics is raising increasing interest 
among researchers and database curators. Each day the volume of data which 
has to be analyzed (mostly manually) increases to unmanageable levels. Thus, 
there is a clear need for automated tools to generate or at least support such 
an annotation process. The annotation process must be transparent in order to 



J.A. Lopez et al. (Eds.): KELSI 2004, LNAI 3303, pp. 20-32, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




Improving Rule Induction Precision for Automated Annotation 



21 



explain/justify to the user the reason of each decision. As symbolic ML algo- 
rithms induce rules that explain their predictions, they are appropriate tools 
for this task. Following previous work on automated annotation using symbolic 
ML techniques, the present work deals with a common problem in ML: that 
classes frequently have skewed class distributions. This is especially the case in 
bioinformatics in general, and in automated protein annotation in particular. 
This happens due to the fact that a large number of proteins is not annotated 
for every feature. In this work, we analyze some pre-processing techniques to 
balance training data sets before applying a symbolic ML algorithm. The aim 
of this procedure is to test and compare different techniques for dealing with 
skewed class distributions, considering the accuracy improvement of the induced 
rules. Our proposal is illustrated on databases related to proteins and families 
of proteins and concerning Arabidopsis thaliana , a model organism for plants. 

This work is organized as follows. The next section describes related work 
concerning automated annotation using ML techniques. Section 3 explains the 
data collection procedure applied in order to generate the data sets used in this 
work and Section 4 discusses the problem of learning with imbalanced data sets. 
Methods commonly used to deal with this problem, as well as our approach 
are detailed in Section 5. Experiments and the results achieved are presented in 
Section 6. Finally, Section 7 concludes and outlines future research possibilities. 

2 Related Work 

There has been an explosion of data, information and computational tools stem- 
ming from genome projects. In some databases, this implies that an increasing 
amount of data must be analyzed manually before it is made available to the 
community. Although several sources of data are used, our concern is with data 
on proteins and families of proteins which can be found in the SWISS-PROT 1 
database. SWISS-PROT is a protein sequence database that provides a high 
level of annotation, such as the description of protein function, domains struc- 
ture, post-translational modifications, variants and others. Data on proteins are 
important to people working in bioinformatics as one of the research goals is to 
understand how proteins interact in order to produce drugs, for instance. More- 
over, SWISS-PROT is a curated database. The current release of SWISS-PROT 
(release 43.6) contains information of more than 150 thousand entries (proteins). 

Automatic annotation and ML are combined in [6] where the authors de- 
scribe a ML approach to generate rules based on already annotated keywords 
of the SWISS-PROT database. Such rules can then be applied to unannotated 
protein sequences. Since this work has actually motivated ours, we provide a 
brief introduction to it here. A detailed description can be found in [6]. 

In short, the authors have developed a method to automate the keyword an- 
notation process in SWISS-PROT, based on the supervised symbolic learning 
algorithm C4.5 [9], using previously annotated keywords regarding proteins as 

1 http : / / www .ebi.ac.uk /swissprot / 
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training data. Such data comprise mainly taxonomy entries, INTERPRO classi- 
fication, and PFAM and PROSITE patterns. Given these data in the attribute- 
value format, C4.5 derives a classification for a target class, in this case, a given 
keyword. 

Since dealing with all data in SWISS-PROT at once would be not manage- 
able due to its size, data were divided into protein groups according to the IN- 
TERPRO classification. Afterwards, each group was submitted to an implemen- 
tation of the learning algorithm C4.5 contained in the Weka 2 software package. 
Rules were generated and a confidence factor for each rule was calculated. Con- 
fidence factors were calculated based on the number of false and true positives, 
by performing a cross-validation and by testing the error rate in predicting key- 
word annotation over the TrEMBL 3 database. TrEMBL is a database similar 
to SWISS-PROT, however it allows data enriched with automated classifica- 
tion and annotation. TrEMBL contains the translations of all coding sequences 
present in the EMBL/GenBank/DDBJ Nucleotide Sequence Databases and also 
protein sequences extracted from the literature or submitted to SWISS-PROT. 

The approach by Kretschmann et al was the basis for an automated annota- 
tion tool to deal with data on mycoplasmas [2], as a way to reduce the data set 
and also because annotating proteins related to mycoplasmas was the aim of this 
project. Since the interest was on the annotation of keywords for proteins related 
to the Mycoplasmataceae family, the generation of rules was based on a reduced 
set of proteins extracted from SWISS-PROT. Thus, it was possible to consider 
all attributes at once, in a different way than the one proposed by [6]. Moreover, 
a single rule for each keyword was generated, thus avoiding inconsistencies in the 
proposed annotation. The rules were evaluated using a set of proteins from the 
TrEMBL database. Results show that the quality of annotation was satisfactory: 
between 60% and 75% of the given keywords were correctly predicted. 

The work in [2] left open the need to improve the class distribution of skewed 
training data sets through appropriate pre-processing methods. The objective is 
to verify if rules induced by symbolic ML algorithms using balanced data sets 
are more accurate then those induced from natural (skewed) distributions. We 
return to this issue by testing the hypothesis that a balanced input data can 
produce more accurate rules for automated annotation. 

3 Data Collection 

In this section we briefly describe our approach to tackle the field “keywords” 
in the SWISS-PROT database. The reader is directed to [2] for more details. 
While the focus of that paper was on annotation of keywords related to sequences 
regarding the family of Mycoplasmataceae, our current work focusses on the 
Arabidopsis thaliana because this is a model organism for plants. Moreover, the 
proteins related to this organism have a better level of annotation, as well as the 
fact that there are more cross-references among databases. This latter issue is 

2 http : / / www. cs . waikato . ac .nz / ~ ml/ weka / 

3 http://www.ebi.ac.uk/trembl/ 
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very important to us since the cross-references build up the basis of attributes 
for ML techniques. 

The raw data were collected directly from the SWISS-PROT database mak- 
ing a query for Organism =Arabidopsis thaliana and selecting only data regard- 
ing keywords (a field of the SWISS-PROT database) which have at least 100 
occurrences. The attributes used to generate the rules are all related to the IN- 
TERPRO classification. A typical input file describes the class (keyword); then a 
number of lines follow indicating how the attributes are mapped for all proteins 
in the training set. 

4 Machine Learning and Imbalanced Data Sets 

Learning from imbalanced data is a difficult task since most learning systems 
are not prepared to cope with a large difference between the number of cases 
belonging to each class. However, real world problems with these characteristics 
are common. Researchers have reported difficulties to learn from imbalanced 
data sets in several domains. Thus, learning with skewed class distributions is 
an important issue in supervised learning. 

Why is learning under such conditions so difficult? Imagine the situation 
illustrated in Figure 1, where there is a large imbalance between the majority 
class (-) and the minority class (+). It also shows that there are some cases 
belonging to the majority class incorrectly labelled (noise). Spare cases from 
the minority class may confuse a classifier like k-Nearest Neighbor (k-NN). For 
instance, 1-NN may incorrectly classify many cases from the minority class (+) 
because the nearest neighbor of these cases are noisy cases belonging to the 
majority class. In a situation where the imbalance is very high, the probability 
of the nearest neighbor of a minority class case (+) being a case of the majority 
class (-) is near 1, and the minority class error rate will tend to be 100%, which 
is unacceptable. 

Decision trees - DTs - also experience a similar problem. In the presence of 
noise, decision trees may become too specialized (overfitting), i. e., the decision 
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Fig. 1 . Many negative cases against some spare positive cases (a) balanced data set 
with well-defined clusters ( b ). 
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tree inducer may need to create many tests to distinguish the minority class 
cases (+) from noisy majority class cases. Pruning the decision tree does not 
necessarily alleviate the problem. This is due to the fact that pruning removes 
some branches considered to be too specialized, labelling new leaf nodes with the 
dominant class on this nodes. Thus, there is a high probability that the majority 
class will also be the dominant class of these leaf nodes. 

It should be observed that the most widely used performance measure for 
learning systems is the overall error rate. However, the overall error rate is par- 
ticularly suspect as a performance measure when studying the effect of class 
distribution on learning since it is strongly biased to favor the majority class [7]. 
When classes are imbalanced, a more reliable performance measure is the area 
under the ROC curve (AUC). ROC 4 graphs [8] are widely used to analyze the 
relationship between false- negative rate and false-positive rate for a classifier, 
and they are consistent for a given problem even if the distribution of positive 
and negative examples is highly skewed. In this work we use both ROC graphs 
and the area under the ROC curve (AUC). The AUC represents the expected 
performance as a single scalar and has a known statistical meaning: it is equiv- 
alent to the Wilcoxon test of ranks, and is equivalent to several other statistical 
measures for evaluating classification and ranking models [4]. Higher values of 
AUC indicate that a classifier will present a better average performance over all 
costs and class distributions. 

5 Treating Imbalanced Data Sets 

One of the most direct ways for dealing with class imbalances is to alter the class 
distributions toward a more balanced distribution. There are two basic methods 
for balancing class distributions: 

Under-sampling which aims to balance the data set by eliminating examples 
of the majority class, and; 

Over-sampling which replicates examples of the minority class in order to 
achieve a more balanced distribution. 

Both, under-sampling and over-sampling, have known drawbacks. Under- 
sampling can throw away potentially useful data, and over-sampling can increase 
the likelihood of occurring overfitting, since most of the over-sampling methods 
make exact copies of the minority class examples. Therefore, a symbolic classifier, 
for instance, might construct rules that are apparently accurate, but actually, 
each rule only covers one replicated example. 

Aiming to overcome the drawbacks previously described, in this work we 
propose a new method for balancing skewed data sets that combines known over 
and under-sampling techniques, namely Smote and Edited Nearest Neighbor 
Rule - ENN. 

4 ROC is an acronym for Receiver Operating Characteristic, a term used in signal 
detection to characterize the tradeoff between hit rate and false alarm rate over a 
noisy channel 
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The Smote over-sampling technique was proposed in [3]. Its main idea is to 
form new minority class examples by interpolating between several minority class 
examples that lie together. Thus, the overfitting problem is avoided and causes 
the decision boundaries for the minority class to spread further into the majority 
class space. We proposed a modification to the Smote technique since its original 
version does not manipulate data sets having only qualitative features, as is the 
case for the data sets analyzed in this work. Our modification consists of: given 
two instances Ei and Ej to be interpolated into a new instance E r , and given 
that Xi f and Xj / are respectively the values of the / th feature of Ei and Ej , the 
corresponding feature value of E r is calculated as follows: if both x-i / and Xjf 
are equal then x r f assumes that value; otherwise we randomly assign one of the 
values Xif or Xjf to x r f. The process of creating new minority class examples is 
illustrated in Figure 2. 
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Fig. 2. Balancing a data set: original data set (a); over-sampled data set with Smote 
(6); identification of examples by ENN (c); and final data set ( d ). 



In (a) the original data set is shown, and in (6) the same data set with 
new minority class examples created artificially is presented. The current im- 
plementation creates as many minority class examples as needed to balance the 
class distributions. This decision is motivated by the results presented in [10], in 
which it is shown that allocating 50% of the training examples to the minority 
class, while it does not always yield optimal results, generally leads to results 
which are no worse than, and often superior to, those which use the natural class 
distributions. 
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Although over-sampling minority class examples can balance class distribu- 
tions, some other problems usually present in data sets with skewed class distri- 
butions have not been solved. Frequently, class clusters are not well-defined since 
some majority class examples might be invading the minority class space. The 
opposite can also be true, since interpolating minority class examples can expand 
the minority class clusters, creating artificial minority class examples too deep 
in the majority class space. Inducing a classifier under such situation can lead 
to overfitting. For instance, a decision tree classifier may have to create several 
branches in order to distinguish among the examples that lie on the wrong side 
of the decision border. 

In order to create more well-defined class clusters we use the ENN tech- 
nique, proposed in [11]. ENN works as follows: each example that does not agree 
with the majority of its k nearest neighbors is removed from the data set. In 
this work we use ENN with k = 3. As we apply this after creating artificially 
minority class examples using Smote until the classes are balanced, then both 
majority and minority class examples are removed. In other words, we remove 
each majority/minority class example from the data set that does not have at 
least two of its 3 nearest neighbors from the same class. 

In Figure 2 the identification of examples to be removed by ENN (c), and 
the final data set without these examples (d) are also shown. 

6 Results and Discussion 

In our experiments, we used the original implementation of the C4.5 symbolic 
learning algorithm to induce decision trees [9]. In order to reduce the amount 
of data to be analyzed, three keywords were selected as target classes: Chloro- 
plast, Nuclear protein and Transmembrane. At the end of the data col- 
lecting process, three attribute- value tables, one for each keyword, were built. 
Table 1 summarizes the data used in this study. For each data set, it shows the 
number of instances (^Instances), number of attributes (# Attributes), num- 
ber of quantitative and qualitative attributes, class attribute distribution and 
the majority class error. This information was obtained using the A iCC++info 
utility [5]. This utility takes an attribute- value data set as an input, and re- 
turns a description of some of the main data characteristics, such as the ones 
shown in Table 1. For example, the data set created for keyword Chloroplast 
consists of 2371 examples with 1263 attributes, all of them qualitative. There 
are two classes named Chloroplast and no_Chloroplast, where 14.34% of the 
2371 instances belong to class Chloroplast and the remainder 85.66% to the 
other class no_Chloroplast. Finally, the majority error refers to the error of a 
classifier that classifies every new example as belonging to the majority class. In 
order to be useful, any ML algorithm should have an error rate lower than the 
majority error. 

An initial experiment with C4.5 with its default value parameters trained 
over the original skewed data sets showed a low false-negative rate ( FN ). Table 2 
summarizes the initial results obtained by C4.5 measured using the resampling 
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Table 1. Data set summary descriptions. 



Keyword 


^Instances 


^Attributes 
(quanti., quali.) 


Class 


Class % 


Majority 

Error 


Chloroplast 


2371 


1263 (0,1263) 


Chloroplast 
no .Chloroplast 


14.34% 

85.66% 


14.34% 

on value no.Chloroplast 


Nuclear protein 


2371 


1263 (0,1263) 


Nuclear protein 
no_Nuclear .protein 


14.21% 

85.79% 


14.21% 

on value noJJuclear.protein 


Transmembrane 


2371 


1263 (0,1263) 


Transmembrane 
no .Transmembrane 


19.23% 

80.77% 


19.23% 

on value no.Transmembrane 



technique 10-fold cross-validation. For instance, for the Chloroplast keyword, it 
is expected that only 0.69% of the examples labelled as no_Chloroplast will be 
erroneously classified as Chloroplast. On the other hand, the false-positive rate 
( FP ) might be considered unacceptable since it is expected that 86.12% of the 
Chloroplast examples will be erroneously classified as no_Chloroplast. 

Table 2. False-positive and false-negative rates, with their respective standard errors, 
for an initial experiment with the original class distribution. 



Keyword 


FP 


FN 


Chloroplast 
Nuclear protein 
Transmembrane 


86.12% (2.08%) 
32.28% (1.96%) 
29.87% (1.93%) 


0.69% (0.19%) 
0.10% (0.07%) 
0.42% (0.13%) 



Aiming to obtain a reference in which the proposed balancing method could 
be compared to, we applied two non-heuristical balancing methods: random over- 
sampling which randomly replicates the minority class examples; and random 
under-sampling which randomly removes majority class examples until a bal- 
anced class distribution is reached. Table 3 presents the results obtained for the 
original data sets, and for the data sets obtained after the application of the 
random and Smote balancing methods, as well as the method proposed in this 
work which combines the Smote over-sampling technique with the ENN tech- 
nique, named as Smote over-sampling + ENN in Table 3. Furthermore, C4.5 
was executed twice, once with its default value parameters which activates the 
decision tree pruning process (25%) and afterwards with the same parameters 
but not allowing pruning. The AUC values and their respective standard errors 
were measured in both cases - columns AUC (pruned) and AUC (unpruned) in 
Table 3. 

Using AUC as a reference metric that combines TP and FP , Table 3 shows 
that the original skewed data sets have AUC values smaller or similar to the 
ones obtained after applying any of the four pre-processing methods, either for 
C4.5 pruned or unpruned induction of decision trees. This shows that balancing 
data sets with skewed class distributions does improve the performance of the 
induced classifiers. 




28 



Gustavo E.A.P.A. Batista, Maria C. Monard, and Ana L.C. Bazzan 



Table 3. AUC values for pruned and unpruned DTs for the original and pre-processed 
data sets. 



Keyword 


Method 


AUC (pruned) 


AUC (unpruned) 


Chloroplast 


Original 

Random Under-sampling 
Random Over-sampling 
Smote Over-sampling 
Smote Over-sampling + ENN 


51.11 (0.58) 

85.66 (0.80) 
79.55 (4.35) 

90.66 ( 0 . 35 ) 

86.73 (2.57) 


59.79 (2.35) 
86.22 (2.27) 
76.24 (2.56) 

94.88 ( 0 . 36 ) 

84.94 (2.93) 


Nuclear protein 


Original 

Random Under-sampling 
Random Over-sampling 
Smote Over-sampling 
Smote Over-sampling + ENN 


46.48 (1.00) 
63.35 (1.95) 
66.83 ( 3 . 20 ) 

66.20 (3.54) 
64.38 (3.45) 


54.96 (1.95) 
66.36 (2.04) 
66.84 (3.20) 
63.13 (3.72) 

92.96 ( 4 . 19 ) 


Transmembrane 


Original 

Random Under-sampling 
Random Over-sampling 
Smote Over-sampling 
Smote Over-sampling + ENN 


46.52 (1.16) 
51.11 (2.20) 
55.27 (3.17) 
53.05 (3.30) 

74.25 ( 6 . 01 ) 


50.64 (1.56) 
50.81 (2.20) 
54.83 (3.22) 
52.23 (6.63) 

79.29 ( 5 . 51 ) 



In what follows, the AUC values as well as the ROC curves obtained using 
the pre-processed data sets to induce C4.5 pruned and unpruned decision trees 
are discussed. 

For the Chloroplast data set, Smote over-sampling obtained the best AUC 
values, i.e, 90.66 (pruned) and 94.88 (unpruned), which, in both cases, have the 
lowest standard error - Table 3. 

Considering the ROC curves for pruned decision trees - Figure 3 it can be 
observed that until around 10% FP the best result is provided by Random over- 
sampling, followed by Smote over-sampling + ENN. Afterwards, Smote takes the 
lead. Furthermore, from 20% TP upwards Smote obtains almost 100% TP. Note 
that from the beginning (0% FP) until around 27% FP , our method provides 
the second best result. Afterwards, the second best result is obtained by Random 
under-sampling. 

The ROC curves for unpruned DTs - Figure 4 - show that before 5% FP 
Random over-sampling is superior. From 5% FP Smote takes the lead obtaining 
nearly 100% TP around 12% FP. From around 15% FP our method and Ran- 
dom under-sampling are the second best. In general, our method shows a similar 
behavior than random under-sampling for this keyword and unpruned DT. 

For the Nuclear protein keyword the AUC values of Random and Smote over- 
sampling are quite similar for pruned DTs - Table 3. However, for unpruned DTs 
the improvement obtained by our method is very good. 

Figure 5 shows that the shapes of the ROC curves for pruned DTs are similar 
for all methods. The best one is Random over-sampling followed by Smote over- 
sampling and our method. 

For unpruned DTs Figure 6 shows that from around 2% FP upwards our 
method obtained much better results than all the other methods, reaching nearly 
90% TP from 5% FP upwards. 
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ROC Curve 




ROC Curve 




For the Transmembrane keyword, our method obtained the best AUCs values 
which are also far better than the ones obtained by all the other methods - 
Table 3. However, the standard error increased considerably. 

Figures 7 and 8 show the ROC curves for the pruned and unpruned DTs 
respectively, showing that from 5% FP upwards our method is much better 
than all the other methods. 

7 Conclusion and Future Work 

This paper presents methods to deal with the problem of learning with skewed 
class distributions applied to automated annotation of keywords in the SWISS- 
PROT database. It also proposes a new method to deal with this problem which 
is based on two known over and under-sampling techniques. Although we are 
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ROC Curve 




ROC Curve 




False Positive Rate 



Fig. 6. ROC curve for Nuclear protein keyword and unpruned DTs. 



interested in symbolic learning in order to induce classifiers that are able to 
explain their predictions, the pre-processing methods presented here can be used 
with other kinds of learning algorithms. The use of symbolic learning algorithms 
for automatic annotation was proposed in [6] and [2], but in neither work the 
problem of imbalance was tackled. Imbalance was initially treated in [1], but 
using other methods and data sets with fewer features and number of instances 
than the ones used in this work. The data used in this work basically come from 
databases of proteins and motifs, and are related to the organism Arabidopsis 
thaliana. 

Experimental results using these data show that the symbolic classifiers in- 
duced by C4.5 using the pre-processed (balanced) data sets outperformed the 
ones induced using the original skewed data sets. 
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ROC Curve 




ROC Curve 




False Positive Rate 

Fig. 8. ROC curve for Transmembrane keyword and unpruned DTs. 



We also analyze the AUC values as well as the ROC curves obtained using the 
data sets which have been balanced by the pre-processing methods treated in this 
work, considering the pruned and unpruned decision trees induced by C4.5. For 
each data set and for both sorts of trees, we show which pre-processing method is 
more appropriate considering the possible distribution of FP on the ROC curves. 
The new method proposed in this work obtained excellent results in three of the 
six cases, and was well-ranked among other methods in the remanding cases. 
Regarding the syntactic complexity of the induced pruned and unpruned decision 
trees, i. e., the number of decision rules and the mean number of conditions per 
rule, it was observed that for all methods the syntactic complexity increases 
with the value of AUC. In other words, the best performances are correlated 
with more complex decision trees. 
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Future possibilities for this research include the use of other pre-processing 
methods to balance data sets as well as new combinations of these methods and 
the use of other symbolic learning systems. 
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Abstract. Multiple sequence alignment (MSA) is a vital problem in biology. 
Optimal alignment of multiple sequences becomes impractical even for a mod- 
est number of sequences [1] since the general version of the problem is NP- 
hard. Because of the high time complexity of traditional MSA algorithms, even 
today’s fast computers are not able to solve the problem for large number of se- 
quences. In this paper we present a randomized algorithm to calculate distance 
matrices which is a major step in many multiple sequence alignment algo- 
rithms. The basic idea employed is sampling (along the lines of [2]). 



1 Introduction 

Sequence alignment is a problem of paramount importance and is a fundamental op- 
eration performed in computational biology research. It also forms the core of the 
Human Genome project, where sequences are compared to see if they have a common 
origin in terms of structure and/or function. The goal is to produce the best alignment 
for a pair of DNA or protein sequences (represented as strings of characters). A good 
alignment has zero or more gaps inserted into the sequences to maximize the number 
of positions in the aligned strings that match. For example, consider aligning the se- 
quences “ATTGGC” and “AGGAC”. By inserting gaps in the appropriate place, 
the number of positions where the two sequences agree can be maximized: ATTGG-C 
A-GGAC [3], 

Often times, it is necessary to evaluate more than two sequences simultaneously in 
order to find out functions, structure and evolution of different organisms. Human 
genome project uses this technique to map and organize DNA and protein sequences 
into groups for later use. Some of the reasons why we do multiple sequence alignment 
are as follows [19]:Infer phylogenetic relationships. Understand evolutionary pres- 
sures acting on a gene, Formulate & test hypotheses about protein 3-D structure 
(based on conserved regions). Formulate & test hypotheses about protein function. 
Understand how protein function has changed and Identify primers and probes to 
search for homologous sequences in other organisms. 

There has been significant research done in this area, because of the need for doing 
multiple sequence alignment for many sequences of varying length. Algorithms deal- 
ing with this problem span from simple comparison and dynamic programming pro- 
cedures, to complex ones, that rely on underlying biological meaning of the sequences 
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to align them more accurately. Since multiple sequence alignment is an NP-Hard 
problem, practical solutions rely on clever heuristics to do the job. There is a constant 
balancing of accuracy versus speed in these algorithms. Accurate algorithms need 
more processing time and are usually capable of comparing only a small number of 
sequences; whereas fast and less accurate ones can analyze many sequences in a rea- 
sonable amount of time. 

Dynamic programming algorithm introduced by Needleman and Wunsch [4] has 
been frequently used in multiple sequence alignments. This algorithm is typically 
used for pair-wise sequence alignments. Feng and Doolittle [5] have developed an 
algorithm for multiple sequence alignment using a modified version of [4] . There are 
more complicated algorithms such as CLUSTAL W [6], which rely on certain scoring 
systems and local homology of the sequences. 

Progressive algorithms suffer from the lack of computational speed because of 
their iterative approach. Also, accuracy is compromised because many algorithms 
(including the dynamic programming) reach a local minimum and could not progress 
further. Algorithms that rely significantly on biological information may also be at a 
disadvantage in some domains. Often times, it is not necessary to find the most accu- 
rate alignment among the sequences. In those cases, specialized algorithms such as 
CLUSTAL W might be overqualified. Also, these algorithms will require some hu- 
man intervention while they are optimizing results. This intervention will have to be 
done by biologists who are very familiar with the data and thus the usage of such an 
algorithm is limited. 

One of the more important usages of MSA is for Phylogenetic analyses [11]. Phy- 
logenetic trees are at the base of understanding evolutionary relationships among 
various species. In order to build a Phylogenetic tree, orthologous sequences have to 
be entered into the database, sequences have to be aligned, pairwise Phylogenetic 
distances have to be calculated and a hierarchical tree has to be calculated using any 
clustering algorithm (see e.g., [8]). 

There are many algorithms that maximize accuracy and do not concern themselves 
with speed. Few improvements have been made successfully to reduce the CPU time, 
since the proposal of the Feng and Doolittle [5] method [7]. Our approach deals with 
reducing CPU time by randomizing some part of multiple sequence alignment proc- 
ess. Our approach calculates distance matrix for star-alignment by randomly selecting 
small portions of sequences and aligning them. Since the randomly selected portions 
of the sequences are significantly smaller than the actual sequences, it will result in a 
significant reduction of the running time. 



2 A Survey of the Literature 

In this section we survey some known results in this area. We also list some compet- 
ing algorithms and applications that are in use today. 

2.1 CLUSTAL W 

CLUSTAL W approach is an improvement of progressive approach invented by Feng 
and Doolittle [5]. CLUSTAL W improves the sensitivity of multiple sequence align- 
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ment without sacrificing speed and efficiency [6]. It will be shown that our algorithm 
is actually faster in theoretical running time than CLUSTAL W. CLUSTAL W takes 
into account different types of weight matrices at each comparison step based on the 
homogeneity of sequences being compared and their evolutionary distances. Results 
of CLUSTAL W are staggeringly accurate. It gives near optimal results for a data set 
with more than 35% identical pairs. For sequences that are divergent, it is difficult to 
find proper weighing scheme and thus does not result in a good alignment. 

2.2 MSA Using Hierarchical Clustering 

Hierarchical clustering is a very interesting heuristic for MSA. It is a rather old ap- 
proach in the fast changing field of bioinformatics. It uses an approach often used in 
bioinformatics, but mostly in the field of data- mining [9, 10]. Distance matrix calcula- 
tion is the central theme in this approach. First distance matrix is calculated for each 
possible pairwise alignment of sequences. The distance matrix is nothing but an M X 
M matrix D such that D[i,j] is the distance (i.e., alignment score) between the two 
sequences Si and Sj (for i and j in the range [1,M]). Here M stands for the number of 
input sequences (to be aligned). The distance matrix can be computed using a fast 
pairwise alignment algorithm such as [2]. Two sequences Si and Sj, which have low- 
est alignment score are chosen out of the matrix and are aligned with each other in 
one cluster. Now, the matrix of size M X M is replaced with a matrix of size (M-l) X 
(M-l) by deleting row j and column j from the original matrix. Also, row i is replaced 
with the average score of i and j [8], This process continues until all sequences are 
aligned and they all form one cluster. 

This algorithm takes O (M 2 N 2 ) time where M is the number of sequences and N is 
the length of sequences when aligned [8]. 

2.3 MAFFT: Fast Fourier Transform Based Approach 

Fast Fourier transform is used to determine homologous regions rapidly. FFT con- 
verts amino acid sequences into sequences composed of volume and polarity [7]. 
MAFFT implements two approaches of FFT, which are progressive method and the 
iterative refinement method, respectively. In this method, correlation between two 
amino acid sequences is calculated using FFT formulas. High correlation value will 
indicate that sequences may have homologous regions [7], This program also has 
sophisticated scoring system for similarity matrix and gap penalties. Just like 
CLUSTAL W, this approach also uses guiding trees and similarity matrices. 

By looking at results presented in [7], we can determine that FFT based algorithms 
are significantly better than CLUSTAL W and T-COFFEE algorithms. It is important 
to notice that all these algorithms are still polynomial time algorithms and thus have 
similar behavior on log scaled graph. The only difference in FFT is that it has a lower 
underlying constant. Thus, from an asymptotic complexity point of view, FFT is not 
significantly better than other approaches. 

2.4 Other Approaches to MSA 

There are many other innovative approaches for MSA. For instance, stochastic proc- 
esses are used to perform MSA. Simulated annealing and Genetic algorithms [11] are 
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classic stochastic processes that have been used for MSA. The algorithm of Berger 
and Munson [1] randomly aligns sequences at first. Then, it iteratively tries to find 
better results and updates sequences until no further improvements can be achieved. 
Gotoh has described such an algorithm in [12]. It is a double nested iterative strategy 
with randomization that optimizes the weighted sum-of-pairs with affine gap penalties 
[11]. 

There is also a relatively recent algorithm by Kececioglu, Lenhof, Mehlhom, 
Mutzen, Reinert and Vingron [ 14], which studies the alignment problem as an integer 
linear program. This algorithm solves the MSA problem optimally when the input 
consists of around 18 sequences. 



3 Randomized Algorithm 

The idea of randomized sampling in the context of local alignment was proposed by 
Rajasekaran et. al [2]. The basic idea is to show that instead of evaluating the entire 
sequences of length N, we can achieve nearly the same result by evaluating N e char- 
acters where 0 < 8 < 1 . 

3.1 Sequences of Uniform Length 

Consider the case when the sequences are of the same length. Our heuristic works to 
reduce the time needed for pairwise-alignments and in effect reducing the overall time 
of any algorithm that requires distance matrix calculations. Consider the problem of 
computing the alignment score between the sequences S and T each of length N. Our 
algorithm selects a substring of length N E from sequence S starting at a randomly 
selected location in the range [1, (N- N E )]. Similarly same length substring starting at 
the same location is chosen from the sequence T. These substrings are aligned and the 
score is recorded. Since the length of these substrings is N e each, the time complexity 
to find pairwise alignment is 0(N 2e ). This will result in an overall run time of 
0(M 2 *N 2e ). This is a significant reduction if the resulting distance matrix can return a 
reliable and accurate score. 

Algorithm 3.1 

Input: A file containing DNA or Protein sequences separated by new line character, value of e. 
Output: Distance matrix for each pair of sequences and the sum of distances for each sequence. 
Algorithm: 

1. Read and store all sequences from the input file into an array. 

2. For every input sequence Ti Do 

a. For every input sequence Pj Do 

i. Select a Random number R that works as a starting point. 

ii. Select |Pj| e characters from Pj starting at position Pj R . 

iii. Similarly select the same number of characters from Ti starting at position Ti R . 
Step ii and iii will result in two new sequences Pj' and Ti’. 

iv. Use Needleman-Wunsch algorithm to evaluate pairwise alignment score of Pj’ and 
Ti'. 

b. Record score from step a-iv in Matrix M at M(Ti, Pj). 
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3. At the end of step 2, we will have a complete matrix M with distance scores for each com- 
bination of sequences. Now sum alignment scores in row order where Sum l = ^ M(Ti,Pj ) . 

j = i 

4. Select the lowest score from Sunx and use it as center of star-alignment. 

5. Repeat the same process for different value of e. 

It is easy to see that the run time of the above algorithm is 0(M 2 *N 2e ), where N is the 
length of each input sequence. 

3.2 Sequences of Non-uniform Length 

The technique given in Section 3.1 works well on sequences of uniform length. As we 
know, most sequences that need to be aligned are not of uniform length. Chopping 
these sequences off at one end to create equality in their lengths may result in losing 
important and useful biological information. A better approach is needed to deal with 
sequences of non-uniform length. The approach we took is to use projection of a 
smaller sequence on to the larger one to get a proportionate length of both sequences. 
In other words, we take two sequences, determine the smaller of the two and pick a 
random starting point for this sequence. Then we project this random starting point on 
to the larger sequence as shown in the image below. 




Fig. 1 . Projection of the first sequence on the second when the sequences are of different 
lengths 

We take appropriate length of subsequences from the original sequences as ex- 
plained in 2a-ii in algorithm 3.1. This process will result in a method that would in- 
sure that sequences of different lengths will not prevent accurate random sampling. 
Once random sample is taken from the two sequences, Needleman Wunsch algorithm 
is applied to them as explained above in section 3.1. Since all sequences could be of 
different length, the time it takes to evaluate algorithm 3.1 will depend on the length 
of the longest sequence of the group that is being aligned. The running time of this 
version of the algorithm is 0(M 2 *|Pj m | 2e ), where |Pj m | is the length of the sample taken 
from the largest sequence. This model is more general then the one explained in 3.1. 

3.3 An Analysis of the Sampling Process 

We can perform an analysis here as well similar to the one done in Section 3.2. The 
idea is as follows. Let S and T be any two input sequences with |S|=s and |T|=t with 
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set. If x is a substring of S then we can correlate this with a corresponding substring 
of T (as shown in the above figure). Let the distance between S and T be K. If one 
assumes that the inserts and deletes have happened at random positions, then we can 
assume that the scores between S and T are uniformly distributed. As a result, if K is 
the distance between S and T and if x is a substring of S (and y the corresponding 
substring of T), then the expected distance between x and y is K|x|/|S|. Using this and 
the Chernoff bounds, we can get a high probability confidence interval for the estima- 
tor. 

3.4 Segmented Algorithm 

This is a slightly different approach from what we have discussed earlier. In our ear- 
lier approaches we have taken the entire random sample from one location of a large 
sequence. Doing so may result in getting samples that will not reflect the characteris- 
tics of the entire sequence. To reduce the affect of this selection bias, we have decided 
to take random samples from different locations of the sequence. In this approach, we 
determine appropriate sample length for each sequence based on the value of e. This 
sample length is divided into three equal lengths (while 3 rd one being larger or smaller 
than the other two to compensate for numbers which are not divisible by 3). Now, 
each random sample is actually a collection of 3 smaller random samples. These 
smaller random samples are taken by identifying three random starting locations and 
taking (N e /3) characters from each starting location (where N is the length of the 
given sequence). Resulting sequence will still be of the length N e and Needleman- 
Wunsch algorithm will be employed on the sequences in the same way as in sec- 
tion 3.2. 

These algorithms use a design from Neobio [15]. The basic set of class framework 
has been referenced from the Neobio package [ 15] The implementation of our algo- 
rithms was carried out in java. 



4 Results 

4.1 Datasets 

We have tested the implementation of this algorithm on many different datasets. The 
purpose of this section is to introduce readers to the types of datasets used for finding 
empirical results for our algorithms. There are four types of data used. 

(1) Protein sequences gathered from BLAST. 

This Dataset was created by taking one Protein sequence and using BLAST to 
find similar sequences. These sequences are taken as input to the algorithms 
mentioned in Section 3. 

(2) Creating X-number of “similar” sequences from one DNA sequence. 

This dataset was created by taking one DNA sequence and manipulating the 
characters of the sequence with a certain probability to create a “similar” se- 
quence. In other words, first character in the original sequence remains the same 
in the new sequence with a probability p but changes to a different character with 
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a probability of (1-p). This dataset could be used as a representation of genetic 
modification in one species over a period of time. It could also be used to repre- 
sent similarities between species that have evolved from a common ancestor. 

(3) Creating a set of completely random DNA sequences. 

This dataset is created by selecting characters randomly to form DNA sequences. 
This dataset is to verify the bias of the algorithms tested. 

(4) Creating groups of “similar” sequences from a set of random DNA sequences. 
This dataset is created by combining features of (2) and (3). First x-numbers of 
random sequences are created and these random sequences are used as base se- 
quences to create groups of “similar” sequences as explained in (2). There is no 
similarity between the groups because they are created from random base se- 
quences. This dataset could be used to represent a comparison of evolution paths 
between totally unrelated species. 

4.2 Results 

There are two algorithms that have been tested for all input sets. The initial algorithm 
explained in 3.1 is not tested as it is, but algorithm 3.2 acts as a superset of algo- 
rithm 3.1, and thus algorithm 3.1 is tested in its entirety through 3.2. Results show the 
value of 8, the winning sequence number (more than one if they are close), running 
time (in mseconds) and in some cases, distance matrix scores. First dataset inputl is 
created by taking a protein sequence of conserved hypothetical protein from 
Methanocaldococcus Jannaschii DSM 2661 [16], and using BLAST [17] to find simi- 
lar sequences. Top 9 similar sequences from the BLAST result are used to create the 
input file. When this input file is used to test algorithm 3.2 (Table 4.2.1) and algo- 
rithm 3.3 ((Table 4.2.2), the following results appeared. 



Table 4.2.1. Input 1 


Epsilon 


String MSeconds 


1 


S8 


1051 


0.9 


S8 


360 


0.8 


S4, S8 


370 


0.7 


S4, S8 


340 


0.6 


S4 


350 


0.5 


S4, S2 


290 


0.4 


S4, S8 


320 



Table 4.2.2. Input 1 



Epsilon 


String 


MSeconds 


1 


S8, S4 


731 


0.9 


S8, S4 


450 


0.8 


S2, S4, S8 


411 


0.7 


S8, S4 


410 


0.6 


S4, S2, S7 


401 


0.5 


S1.S2, S4, S8 


420 


0.4 


S2, S1.S4 


360 
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As we can see from the results in Table 4.2.1, for e =1.0, it takes 1051 mSeconds to 
determine the sequence with the lowest distance matrix score. The resulting sequence 
is S8 and can be used as a base case. Now, for e< 1.0, we can see dramatic reduction 
in the time it takes to run the algorithm. Another thing to keep in mind is that for most 
values of 8, sequence S8 seems to be the best or top 2 choices. Since we already know 
that S8 has the best score when entire sequences are used for finding distance matrix, 
it is encouraging to find the same sequence as the “winner” for values of 8 < 1.0. 
Table 4.2.2 shows results for the same input dataset when algorithm 3.3 was used. 
Although for most values of e, time taken for algorithm 3.3 is a little higher than that 
when algorithm 3.2 was used, it is still significantly lower than that of e = 1.0. Also, 
in Table 4.2.2, S8 and S4 appear to be the winning or second winning sequences. 
Scores between the top choice and the 2nd best choice are very close to each other, 
and thus can be seen as “equally” good centers for a star-alignment. 

Dataset Input2 is created by taking first 300 nucleotides from a DNA sequence 
from Homo Sapiens PAC clone RP5-943F2 from 7 [18] and using BLAST [17] to 
find closest matching DNA sequences. Top 6 sequences from this BLAST search 
were taken to create Input2. When this input file is used to test algorithm 3.2 (Ta- 
ble 4.2.1b) and algorithm 3.3 ((Table 4.2.2b), the following results appeared. 



Table 4.2.1b. Input 2 



Epsilon 


String 


MSeconds 


Scores 


1 


S3, S5 


1001 


-911,-951 


0.9 


S3, S5 


580 


-457, -484 


0.8 


S3, S5 


501 


-230, -245 


0.7 


S5, S3 


510 


-110, -117 


0.6 


S3, S2, S5 


420 


-70, -70, -74 


0.5 


S3, S6 


440 


-26, -34 


0.4 


S5, SI, S3 


400 


-22, -22, -24 



The results show that S3 is a clear winner in Table 4.2.1b. Results are less conclu- 
sive for Table 4.2.2b, but do point strongly towards S3 being a clear winner. As we 
can see the running for both algorithms is greatly reduced as the value of Epsilon goes 
low. We would need to find a way to specifically determine which one will be the 
ultimate winner when there is more than one probable winner. 



Table 4.2.2b. Input 2 



Epsilon 


String 


MSeconds 


Scores 


1 


S3, S5 


1061 


-884, -928 


0.9 


S3, S5 


611 


-452, -452 


0.8 


S3, S5 


401 


-225, -230 


0.7 


S5, SI, S2 


421 


-124, -131, -132 


0.6 


S5, S4, S3 


380 


-62, -62, -72 


0.5 


S3, S6 


270 


-37, -41 


0.4 


S4, S5 


350 


-24, -26 



Taking a DNA sequence and creating the remaining sequences from that sequence 
create Dataset Input3. As explained in Section 5.1 (2), each character will be inherited 
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by the new sequence as it is with a Probability of 0.94 and will change into some 
other character with a probability of 0.06. This will create a set of sequences which 
are almost similar to the each other but still remain a little different because of the 6% 
probability of change. There are 21 sequences in this file. When this input file is used 
to test algorithm 3.2 (Table 4.2.3) and algorithm (Table 4.2.4), the following results 
appeared. 



Table 4.2.3. Input 3 P(0.94) 



Epsilon 


String 


MSeconds 


Scores 


1 


SI 


7671 


5864 


0.9 


SI 


3225 


3260 


0.8 


SI 


2794 


1852 


0.7 


SI 


2223 


1054 


0.6 


SI 


2153 


556 


0.5 


SI 


2093 


336 


Table 4.2.4. Input 3 P(0.94) 


Epsilon 


String 


Mseconds 


Scores 


1 


SI 


6059 


5868 


0.9 


SI 


2384 


3282 


0.8 


SI 


1732 


1850 


0.7 


SI 


1582 


1002 


0.6 


SI 


1512 


596 


0.5 


S18, SI 


1522 


332, 306 


0.4 


SI, S17 


1523 


194, 194 



These results show clearly the best sequence chosen by both algorithms is se- 
quence SI. In this dataset, SI is a winner for obvious reasons. All sequences are cre- 
ated from a base sequence, which in this case is SI. Since they are all created from 
SI, SI would be the best matching sequence to all other sequences. Both Table 4.2.3 
and 4.2.4 confirm this logic. 

Dataset Input4 is a slight variation of Input3. The significance of Input3 was that 
each sequence was created from one base sequence, and thus represented genetic 
changes from one species into various descending species. Modifying the sequence 
previous to it in the same way as explained in Input3 creates sequences in Input4. 
Each sequence is modified with a probability of 0.99 of staying same and 0.01 of 
changing to a different character. In Input4, we have created 2 1 sequences each from 
the sequence above it. When this input file is used to test algorithm 3.2 (Table 4.2.5) 
and algorithm 3.3 (Table 4.2.6), the following results appeared. 



Table 4.2.5. Input 4 P(0.99) 



Epsilon 


String 


MSeconds 


Scores 


1 


S12, S10 


6409 


3214, 3180 


0.9 


S10, S12 


2253 


1762, 1751 


0.8 


S12, S13, S10 


1773 


1054, 1004, 998 


0.7 


S10, S12 


1632 


578, 550 


0.6 


S11.S9, S13 


1542 


321, 311, 311 


0.5 


S13, S12 


1462 


215, 198 
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Table 4.2.6. Input 4 P(0.99) 



Epsilon 


String 


MSeconds 


Scores 


1 


S10, S9 


8232 


3081,3045 


0.9 


S12, S10 


3215 


1790, 1773 


0.8 


S10, S12 


3205 


1001, 977 


0.7 


S9, S13 


2924 


559, 550 


0.6 


SI 1, S12 


2323 


353, 325 


0.5 


S13, S12 


2414 


209, 200 


0.4 


S12, Sll 


2194 


117, 115 



The results show a less clear “winner” for Input4. After carefully looking at the re- 
sults, we can see that in Table 4.2.5, S12 is either the top or the 2 nd best choice for all 
values of Epsilon. Also, another interesting thing to notice is that the sequences that 
are selected are from the “middle” part of the sequence collection. In other words, out 
of 21 sequences, S10-S12 are more often winners than any others. This result makes 
sense because of the way sequences are created for Input4. Since each sequence is 
created from the sequence above it, the sequences in the middle (S10-S12) will have 
characteristics of all of the preceding sequences as well as have influence over all the 
following sequences. Because of this reason, the sequences in the middle will be in 
the best position to be a “winner”. This configuration of sequences represents trans- 
formation in a particular species’ over many generations. Every generation takes the 
sequence from its parent and adds certain modifications to create a new set of gene 
sequences. 

Taking 7 random sequences created Dataset Input5. Each sequence is completely 
random and does not share any similarity to any other sequences. When this input file 
is used to test algorithm 3.2 (Table 4.2.7) and algorithm 3.3 (Table 4.2.8), the follow- 
ing results appeared. 



Table 4.2.7. Input 5, 7 Sequences 



Epsilon 


String 


MSeconds 


Scores 


1 


S7, S6 


1682 


-473, -488 


0.9 


S2, S7 


621 


-290, -291 


0.8 


S7, S2 


291 


-150,-154 


0.7 


S7, S5 


230 


-96,-110 


0.6 


S5, S3 


210 


-52, -64 


0.5 


S6, S5, S7 


190 


-51,-53,-54 


0.4 


S6, S5, S7 


210 


-22, -32, -34 


Table 4.2.8. Input 5, 7 Sequences 


Epsilon 


String 


MSeconds 


Scores 


1 


S4, S5 


1773 


-436, -439 


0.9 


S7, S6 


641 


-212,-281 


0.8 


S5, S4 


330 


-154,-160 


0.7 


S2, SI 


330 


-91. -99 


0.6 


S5, SI 


250 


-56, -72 


0.5 


S7, S4 


380 


-35, -37 


0.4 


S3, S7 


261 


-21, -25 
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As we can see from the results of Figure 5.2.7, S7 appears to be the best or 2 nd best 
matching sequence for all values of Epsilon except 0.6. We can also see a significant 
reduction in time as the value of Epsilon goes down. Algorithm 3.3 doesn’t give out 
conclusive results but does have S5 as a top choice for most values of epsilon. Since 
these are completely random sequences, they do not necessarily have any similarity 
with each other and thus are less likely to give a clear winner for different values of 
epsilon. In actual sequences this will not be the case since there will be actual simi- 
larities between all sequences. 

Taking the 7 random sequences of Input5 and creating groups from each sequence 
and putting them all together to create a large input file created Dataset Input6. In 
other words, Input6 is a cluster of groups. Each group is created by taking a random 
sequence and applying the technique explained in Input3. Each group contains 5 se- 
quences that are created based on the random sequence used as a base sequence. 
When this input file is used to test algorithm 3.2 (Table 4.2.9) and algorithm 3.3 (Ta- 
ble 4.2.10), the following results appeared. 



Table 4.2.9. Input 6 



Epsilon 


String 


MSeconds 


Scores 


1 


S31, S26 


53417 


-262, -293 


0.9 


S21, S31 


15772 


-293, -307 


0.8 


S31, S33, S21 


6009 


-174, -235, -264 


0.7 


S31, S6 


5078 


-184,-195 


0.6 


S4, S31, S21 


4797 


-144, -159, -171 


0.5 


S33, S13, S31 


4767 


-110,-111,-119 


0.4 


S20, S27, S10 


4807 


-71,-73, -75 



Table 4.2.10. Input 6 



Epsilon 


String 


MSeconds 


Scores 


1 


S16, S21 


46116 


-90, -106 


0.9 


S21, S31 


17585 


-177, -211 


0.8 


S3, S23 


8082 


-231, -257 


0.7 


S6, S19, S21 


7410 


-202, -203, -203 


0.6 


S6, S8, S31 


5899 


-145, -151. -157 


0.5 


S3, S25, S16 


5388 


-111,-116,-119 


0.4 


S2, S7 


5438 


-57, -60 



We have 35 sequences in this example. In table 4.2.9, we can see that S31 is the 
most probable winner for all values of epsilon. S31 represents the 7 th group, and S31 
is the “base sequence” from which other 4 sequences are created. Our result in ta- 
ble 4.2.9 confirms what we saw in table 4.2.7 that S7 was the most probable winner. 
Table 4.2.10 doesn’t have a clear winner but does reflect the fact that winners are 
either the sequences that were winner in Table 4.2.8 or belong to the group that were 
winners from the Table 4.2.8. 



5 Discussion 

In this paper, we have discussed various methods of Multiple Sequence Alignment. 
We have also introduced a new approach that deals with randomly sampling se- 
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quences and aligning the samples to achieve nearly the same result in terms of dis- 
tance matrix calculation and achieve a significant runtime improvement. Furthermore, 
we have extended the randomization approach to include non-uniform length se- 
quences and also taken segmented approach to improve accuracy. We have backed up 
our claim of speedup and accuracy by empirical data and examples. As it can be seen 
from the results, our algorithms have worked well on various types of data sets. Ran- 
domization can be used to significantly improve MSA running time while keeping the 
results fairly accurate. 



6 Future Work 

We plan to make certain very critical improvements to our algorithm. First of all, we 
would like to analyze the sampling process under other models than the ones assumed 
in this paper. As it has appeared in some results, we have to find a way to accurately 
determine the most probable winning sequence when there is more than one sequence 
which is close to being winner. In order to determine a single winning sequence out of 
more than one probable sequence, we can use full-alignment of these sequences and 
determine which one is the best match. Although full alignment is time consuming, 
doing it for only a small number of sequences will give a lot better results. There is a 
possibility of taking this work further and implementing randomized portions for 
CLUSTAL W, MAFFT and other popular MSA packages in order to increase their 
speed. In our opinion, further speedup can be achieved by randomizing not just pair- 
wise alignment but also sequence selection, but this hypothesis still needs further 
work. 
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Abstract. In this paper we report outcomes of our computational anal- 
ysis applied to time-series gene expression data generated by Kagami 
et al [7]. Gene expression data were generated using Affimetrix chips 
and validated by quantitative RT-PCR (reverse transcription-polymerase 
chain reaction) expression analysis of 12 randomly selected and differen- 
tially expressed genes. The biotin-labelled cRNA samples generated from 
mouse cerebella samples were collected at five developmental stages: 1 
prenatal (embryonic day 18 or E18) and 4 postnatal at 7, 14, 21 and 56 
days (P7, P14, P21 and P56). 

We propose APRIORI-GST, an APRIORI-like algorithm that uses a 
GST index for discovering sequential patterns from microarray data. 
From the extracted patterns we outline the hypothesis that there is a 
lot of gene activity between the prenatal stage E18 and postnatal stage 
P7, which needs to be further investigated. 

Keywords: Sequential pattern mining, Suffix Tree, Apriori, Affymetrix 
GeneChip, Differential expression, DNA chip, DNA microarray, Expres- 
sion analysis, Gene expression. 



1 Introduction 

Given the advent of microarray technology, it is now possible to analyze the ex- 
pression of a large number of genes simultaneously [10]. Microarray experiments 
can be classified according to the nature of the samples, i.e. time of collection, 
location, type of tissue, class of tumor, etc. In the present paper we are interested 
in exploring our computational methods when applied to time series microar- 
ray experiments. In particular we report results applied to gene expression time 
series associated to Mouse Cerebellum development [7, 8, 17]. 

* Jesus Lopez was at the University of Ulster when this research was done. 



J.A. Lopez et al. (Eds.): KELSI 2004, LNAI 3303, pp. 46—57, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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1.1 Biological Motivation and Gene Expression Data Generation 

In this paper we report outcomes of our computational analysis applied to time- 
series gene expression data generated by Kagami et al [7]. The data is publicly 
available through [1], In such study Kagami et al [7] investigated differentially 
expressed genes during the development of mouse cerebellum. Their biologi- 
cal interest was focused to further understanding the molecular basis of mouse 
cerebellum development. The mouse cerebellum is not entirely developed until 
post-natal day 21, therefore their experiment was an ideal framework for the 
understanding of the genetic foundations and mechanisms of neural develop- 
ment [11-13]. 

Gene expression data were generated using Affimetric chips and validated 
by quantitative RT-PCR expression analysis of 12 randomly selected and dif- 
ferentially expressed genes. Samples were independently hybridized in dupli- 
cate with 25-mer oligonucleotide sequences representing 12,654 genes on MullK 
GeneChips. 

The biotin-labelled cRNA samples generated from mouse cerebella samples 
were collected at five developmental stages, 1 prenatal (embryonic day 18 or 
E18) and 4 postnatal at 7, 14, 21 and 56 days (P7, P14, P21 and P56). The 
postnatal morphological and neurological development of the mouse cerebellum 
has being studied and reported [3,5]. 

1.2 Suffix Trees and Gene Data 

Suffix Trees are widely used in computational biology, especially for genome 
alignment [6, 9]. As far as we know, our approach of using an algorithm based on 
Generalized Suffix Trees (GST) and Apriori for microarray data is an original 
approach for this domain. We succesfully applied this algorithm to Web Usage 
data and the results are reported in [15]. 

2 Method 

Sequential pattern mining is an advanced data mining technique that extracts 
frequently occuring patterns in given sequences. As an example, from a dataset 
for market basket analysis, one could find a pattern like “ Customers who buy 
digital cameras will later buy memory cards and then photo printers’’’ . 

With a proper pre-processing these kind of patterns can also be extracted 
from microarray data. In this section we detail our method for extracting se- 
quential patterns from microarray data. 

2.1 Sequential Pattern Mining 

The problem of finding sequential patterns was derived from association rule 
mining in [14]. 

In [2], the association rules mining problem is defined as follows: 

Definition 1. Let I = {*i,*2, a set of m literals (items). Let D = 

{ti, < 2 , be a set of n transactions ; Associated with each transaction is a 
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unique identifier, called TID and an itemset I. I is a k-itemset where k is the 
number of items in I. We say that a transaction T contains X , a set of some 
items in I, if X C T. The support, of an itemset I is the fraction of transactions 
in D containing I: supp(I) = || {£ £ D \ I C t}||/||{f £ _D}||. An association 
ride is an implication of the form I\ => In, where I\, I 2 C / and I\ fl I 2 = 0. 
The rule I\ =>• I 2 holds in the transaction set D with confidence c if c% of 
transactions in D that contain I\ also contain I 2 ■ The rule r : I\ => I 2 has 
support s in the transaction set D if s% of transactions in D contain I\ U I 2 
(i.e. suppfr) = supp(I\ U I^))- 

Given two parameters specified by the user, minsupp and minconfidence, the 
problem of association rule mining in a database D aims at providing the set of 
frequent itemsets in D, i.e. all the itemsets having support greater or equal to 
minsupp. Association rules with confidence greater than minconfidence are thus 
generated. 

As this definition does not take time into consideration, the sequential patterns 
are defined in [14]: 

Definition 2. A sequence is an ordered list of itemsets denoted by (S1S2 . . . s n ) 
where Sj is an itemset. The data-sequence of a customer c is the sequence in D 
corresponding to customer c. A sequence < a\a 2 . . . a n > is a subsequence of 
another sequence < 6162 . . . b m > if there exist integers i\ < *2 < . . . < i n such 
that ai C bi 1 ,a 2 G bi 2 , ... ,a n C 6 in . 

Example 1. Let C be a client and S=< (3) (4 5) (8) >, be that client’s purchases. 
S means that “C bought item 3, then he or she bought 4 and 5 at the same 
moment (i.e. in the same transaction) and finally bought item 8”. 

Definition 3. The support for a sequence S, also called supp(S), is defined as 
the fraction of total data-sequences that, contain S. If supp(S) > minsupp, with a 
minimum support value minsupp given by the user, S is considered as a frequent 
sequential pattern. 



2.2 Encoding the Microarray Data 

The original dataset taken from [7] contains 5 series of normalised expressed 
values for 897 genes. We considered this dataset as 897 sequences (time ordered), 
so we have one sequence for each gene Gi. 

Thus we can express a gene as Gi =< Gi(ti), Gj^), Gifts), Gi(t 4 ,), Gifts) >, 
each Giftj) representing the normalised expression value for gene Gi at time tj. 

Next we replaced the normalised values with their log values: Log(Gi) =< 
log(Gi(t 1 )),log(Gi(t 2 )), ■■., log(Gi(t 5 ))). 

We computed the mean (Mean) and standard deviation (STD) for the gene 
expression values of the initial 5 series: 

Meanftj) = Mean(log(Gi(tj))), for each i = 1..897, for each j = 1..5 
STDftj) = ST D(log(Gi(tj))) , for each i = 1..897, for each j = 1..5 
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We computed the ZScore as follows: 

Z Score{Gi(tj)) = ( log(Gi(tj )) — Meanftj)) / STDftj), for each i = 1..897, for 
each j = 1..5 

Finally, to each Gi(tj) was given one of the three discretisation values (e + - 

expressed at high level, e° - expressed at medium level or e~ - expressed at low 

level) according to its ZScore: 

if ZScore(Gi(tj)) > 1.96 then Gfftj ) = e + 

else if ZScore(Gi(tj)) < —1.96 then Gfftj) = e~ 

else Gi(tj) = e°. 

2.3 Indexing the Gene Regulatory Expressions 

In order to index the gene regulatory expressions (GREs), we use the GST 
index [6] as described in the next two paragraphs. 

Suffix Tree Index. Basically, a Suffix Tree (ST) is a data structure used for 
text indexing. Such an index is mainly used to search for a sub-string in linear 
time. This search is made possible by an initial treatment of the text which is 
also realised in linear time. We will give a more complete definition as follows. 

Definition 4. Let S be a string , S = X\X 2 ---x n . A suffix of S is S[i,n\ = 
XiXi+i...x n . The Suffix Tree T for S is a tree with n leaves such that: 

• it exists a one-to-one relationship between any leaf of T and a suffix of S, 
S[i,n], This leaf is labelled with i. 

• the edges are labelled with non-empty words. 

• the degree of its internal nodes is > 1. 

• for a particular node, all children’s labels begin with a different letter. 

• the concatenation of the edges ’ labels, of the path from the root node to a leaf 
i, forms the suffix S^z, n] of S. 

Hypothesis: No suffix is a prefix of another suffix [4]. We have taken from [6] 
the example of the ST for the string xabxac (cf. Fig. 1). The path from the root 
to the leaf number 1 spells exactly S [1 , n] = xabxac. The path from the root 
to the leaf number 5 spells S [5, n] = ac. 




Fig. 1. A suffix tree for the string xabxac [6] 
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Without the final hypothesis, the definition of the suffix tree does not guar- 
antee that we will have a corresponding suffix tree for each string. For example, 
if we consider the string "xabxa", the sequence "xa" is, at the same time, a 
suffix and a prefix for this string. So, we cannot build a corresponding suffix tree 
for this string, according to the definition we have just announced. In order to 
solve this problem, we must add a new character, generically noted $, at the end 
of the string. This guarantees that any suffix is not also a prefix. 

There are several algorithms for building a suffix tree in O(n) time [6]. We 
use the Ukkonen algorithm [16] for its good time and memory-occupation perfor- 
mances. For this algorithm the construction of the ST is incremental and counts 
n steps, one for each suffix S[i, n] of S. For space reasons, we do not give here 
the description of the algorithm, but the interested reader can refer to [6] for a 
complete and rigorous description of the method. 

Generalised Suffix Tree Index. At the begining of this section 1.2 we have 
mentioned that we are using a GST index [6]. Until now we have presented the 
ST index for a string. A GST index is a ST index for at least two strings. 
Notations: Before presenting the algorithm used to build a GST, let us intro- 
duce some notations that we are going to use in this paper: 

• N is the total number of strings to be indexed. 

• Sj , (i=l, . . , N) represents the i-tli string. 

• T(i) is the GST for the set {Sj | 0 < j < i + 1}. 

• R(T(i)) is the root node of T(i). 

• v(i , e) is an internal node of the tree (other than R(T(i))). i is the number 
of the string which was indexed when the node was created. The edge of the 
node v, noted e, is represented by the pair (ei, e 2 ), where e\ is the index 
where the label of the edge e begins, in the string Si, and e 2 the index where 
e ends. 

• l(i, e, P/)isa leaf of the tree i and e have the same meaning as for an 

internal node. Pi = {(ii, jk)} is the set of the suffixes represented 

by this leaf. We say that a suffix Si [j, n] is represented by the leaf l if the 
factor 1 of the node l exactly matches S t [j, n] . In this case the pair (i,j) 
belongs to P/. 

Using these notations we give the following two steps of the algorithm for building 
a GST index: 

Step 1: we build T(l), the ST for Si, using the Ukkonen algorithm. 

Step 2: for any string Si, 0 < i < N + 1 we traverse in the current GST 
following the path Si as far as possible. Let us suppose that we arrive at 
position j in Si (this means that the first j — 1 characters from S, : are 
contained in the current GST). We obtain T(z) by applying the Ukkonen 
algorithm from step j until all the suffixes S)[p..rq], p > j — 1 are added to 

m- 1). 

The factor of a node v is the string formed by the concatenation of all the labels of 
the edges in the path from R(T(i)) to v. 



l 
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It is possible that several strings have a common suffix. In this case, the corre- 
sponding leaf l(i, e, Pi) will contain this information in the set Pi. The set Pi of 
l(i, e, Pi) contains all the pairs (i,ji) ( ji represents the beginning of the common 
suffix in Si). 

In Fig. 2 we have an example of the GST obtained by adding the string 
“babxba” to the ST for “xabxa”. 




2.4 Sequential Patterns Discovery in Microarray Data 

We propose APRIORI-GST, an APRIORI-like algorithm that uses a GST index 
for discovering sequential patterns from microarray data. The microarray data 
is transformed into sequences of three possible levels of exposure (e + , e° or e ~ , 
cf. section 2.2). These sequences are indexed using a GST index (cf. section 2.3). 
A microarray sequential pattern may be seen, in this case, as a sub-sequence 
of levels of exposures that frequently occur. We will apply the APRIORI-GST 
algorithm, which is described below, to discover such sequences. 

We remind that the support for a sequence S is defined as the ratio between 
the number of sequences containing S and the total number of sequences. In our 
method, the minimum support minsupp is the only parameter the user inputs. 

In this new context a sequence is defined as follows: let I = {e + ,e°,e~} be 
the set of the three possible exposure levels. A sequence S, for a gene g is a set of 
items (exposure levels) ordered by their time stamp and noted S = (si S 2 ... s n ). 
A k-sequence is a sequence of k items. A k-sequence S' is a frequent sequence 
if the support of S is bigger than minsupp , the minimum support. We say 
that the sequence S is a sub-sequence of another sequence S' = (.s) s' 2 ...s(„), 
(with n < m) if there are two positive integers j,k (with n = k — j + 1), s.t. 
Si=s'j, S2=s' +1 ,..., s n =s' k . In order to determine the frequent sequences, the 
APRIORI-GST algorithm tests, at each step k, all the k-sequences from the 
set Ck (candidate k-sequences). This set is filtered using the minimum support 
minsupp and we obtain the set of frequent k-sequences, L k (sequences having 
the support > minsupp). A join on Lk gives the C k +i set used in the next step. 
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The initial C\ set is formed by all the items of I and the algorithm stops at step 
k, when Ck is empty. As we can see, the test for determining the support of a 
k-sequence is done very often. Below we give the recursive function supp that 
we use to calculate the support for a k-sequence. 

Function supp(S, v) 

// S = (si S 2 ... s n ) a k-sequence, v = (i, e, child , P v ) a node of the GST 
if ( v is null) then return 0; 
else 

if ( S is empty) then return v.ds/N\ 
else 

nextNode = v.child(si); 

return supp(suffix(S, nextNode), nextNode); 

// Output: the sequence’s support (a real value between [0,1]). 

end supp; 

Let us note that v.child(si) is the child of v, which is introduced by the edge that 
begins with si 2 . The function suffix(S, nextNode) removes at the beginning 
of S the items from the edge between v and nextNode. v.ds gives the total 
number of distinct sequences found in all the sets Pi, where l is a leaf of the 
subtree that has v as the root, v.ds is calculated after the construction of the 
GST for each node and its value is updated when new sequences are added. 

Originality of Our APRIORI-GST Algorithm: The algorithm has two 
major differences from the classical APRIORI algorithm. First, for calculating 
the support of a k-sequence we used the supp function. Second, in the APRI- 
ORI iteration, when we calculate L k, we directly compute the support for all 
candidates c of Ck without checking against all the GREs S from the database. 
This implies fewer access to the disk for reading the sequences because the GST 
is kept in the main memory. 

Procedure APRIORI-GST (S, I, minsupp, T(N)) 

/ / Input: 

// S the set of N sequences, / the set of items, 

/ / minsupp the minimum support, T(N) the GST 
k = 1; Cj = J; 
while ( Ck yf 4 > ) do 

for each (c £ Ck) do 

if (supp(c, R{T)) > minsupp) 
then Lk = Lk U {c}; 
k = k + 1; 

GenerateCandidate (Ck, Lk- 1 ); 

end while; 
return L={7j c _ 1 L, 

/ / Output: the set L of frequent sequences 

end APRIORI-GST; 

2 We recall that for an internal node in a ST all children’s edges begin with a different 
item. 
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CREPminer 



GREPminer Li 

AXIS ffi / N,R J A t 



Input File: |/o/axis/dtanasa/thesis/mission_cost/cosl/gse2_2_2. st2.txt 
Support: [ 0.5 



Browse ... 



> Run 



Sequences 
[eO e+ eO eO eO] 60.31% 
[e+ eO eO eO] 60.76% 

[eO e+ eO eO] 60.42% 

[eO eO eO] 99.33 % 

[eO e+ eO] 60.98 % 

[e+ eO eO] 60.87 % 

[eO eO] 99.44 % 

[e+ eO] 61.43% 

[eO e+] 60.98 % 



ID 


GB.ACC 


DESCRIPTION 


FUNCTION SUBFUNCTION 1 


1 


AF044672 


alpha-synudein 


CGDD 


death 


2 


1)96116 


amyloid beta-peptide binding protein (ERAB) 


CGDD 


(death 


5 


[AF027707 


apoptosis activator Mtd (Mtd) 


CGDD 


(death 


6 


ID28492 


]Caspase 2 


CGDD 


(death 


7 


hrii9Q5 


BCL7C 


CGDD 


|death 


8 


AF015769 


|manic fringe (manic-fringe) 


iCGDD 


growth/diff. 


10 


AF053454 


tetraspan TM4SF (Tspan-6) 


CGDD 


growth/diff. 


11 


L12447 


Insulin-like growth factor binding protein 5 


CGDD 


growth/diff. 


12 


|L2 2 472 


Bcl2 -associated X protein 


CGDD 


(growth/diff. 


13 


IM35970 


tumor metastatic process-associated protein (NM23) 


CGDD 


(growth/diff. 


14 


M96163 


serum inducible kinase (SNK) 


CGDD 


(growth/diff. 


15 


U33629 


Myeloid ecotropic viral integration site 1 


ICGDD 


(growth/diff. 


17 


IU42383 


Fibroblast growth factor inducible 13 


CGDD 


growth/diff. 


19 


U84411 


protein tyrosine phosphatase (PRL-1) 


CGDD 


growth/diff. 


22 


X53928 


PGI (biglycan) 


CGDD 


growth/diff. 


23 


1X53929 


[Decorin 


CGDD 


growth/diff. 


24 


X70842 


Kinase insert domain protein receptor (Flk-1) 


CGDD 


growth/diff. 


25 


1X74760 


Notch gene homolog 3, (Drosophila) 


CGDD 


growth/diff. 



Fig. 3. The GREPminer tool implementing the Apriori-GST algorithm 



3 Results 

To support our methodology, we designed and implemented in Java, the GREP- 
miner 3 tool presented in Fig. 3. The user chooses a dataset file and extracts 
sequential patterns having the support superior to a specified threshold. The 
extracted frequent sequential patterns are listed on the left side and the details 
(list of genes) for the selected pattern is displayed on the right side. 

We used a temporal dataset described in [7] taken from the GEO reposi- 
tory [1], The dataset consisted in 5 series of expression levels for 897 genes. The 
gene list comprises only those genes that Kagami et al. [7] have selected as being 
significantly expressed. Their criteria are: showing more than a two- fold differ- 
ence between maximum and minimum intensity, the maximum intensity must 
be > 1000 units, and the differential expression must be seen in both replicates. 

We executed several tests. First we extracted frequent sequential patterns 
having the support superior to 50% (cf. Table 1). Next, we extracted all the 
patterns from this dataset by specifying a support of at least 1 sequence (i.e. 
0.11%) and we obtained 40 patterns listed in the table from the Appendix. 

The execution time for our application was less than 1 second with the lowest 
possible support for this dataset (0.11%). 

4 Discussion and Conclusions 

In Table 1 the interesting pattern is Pi which concerns a large number of genes 
(60.31%) and consists on a high expressed gene on postnatal P7 and medium 

3 Gene Regulatory Expression Profiles Miner. 
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Table 1. List of all patterns with support > 50% 



PatternID 


Pattern 


Support 


Number of Genes 


Pi 


[e° e + e° e° e°] 


60.31% 


541 


P2 


[e + e° e° e°] 


60.76% 


545 


Ps 


[e° e+ e° e°l 


60.42% 


542 


Pi 


O 

O 

O 
. 0) 


99.33% 


891 


Ps 


[e° e + e°] 


60.98% 


547 


Pe 


[e + e° e°] 


60.87% 


546 


Pr 


[e° e°] 


99.44% 


892 


Ps 


[e+ e°] 


61.43% 


551 


P9 


[e° e+] 


60.98% 


547 



Table 2. List of all patterns of length 5 (partition over the dataset) 



PatternID Pattern Support Number of Genes 



Pi 


[e° 


e + 


e° 


e° 


e°] 


60.31 


541 


Po. 


[e° 




e° 


e° 


e°] 


16.05 


144 


Ps 


e 


e~ 


e° 


e° 


e°] 


12.93 


116 


Pi 


[e° 


e° 


e° 


e° 


e°] 


9.25 


83 


P 5 


[e 


e+ 


e° 


e° 


e°] 


0.45 


4 


Ps 


[e° 


e+ 


e° 


e“ 


e ] 


0.33 


3 


P7 


[e° 


e+ 


e° 


e~ 


e°] 


0.22 


2 


Ps 


[e 


e° 


e° 


e° 


e°] 


0.22 


2 


Po 


[e° 


e° 


e° 


e~ 


e°] 


0.11 


1 


P10 


[e° 


e + 


e° 


e° 


e ] 


0.11 


1 



expressed on the other stages. This leads us to the conclusion that there is a lot 
of gene activity around the stage P7. 

The Table 2 that contains all the patterns of length 5 can be regarded as a 
partition over the set of all genes. Here a number of interesting time patterns can 
be observed. The first two patterns, P\ ([e° e + e° e° e 0 ]) and P2 ([e° e~ e° e° e 0 ]) 
are associated with high support figures, 60.31% and 16.05% respectively. The 
first pattern, which is the most supported, proves that a high number of genes 
are highly expressed at the first postnatal observed stage (P7) but medium 
expressed otherwise. In contrast the second pattern represents genes that from 
medium expressed values go to low expressed values on P7. In fact, except two 
patterns (P4 and Pg), all the other patterns change their expression values on 
or after the postnatal stage P7. Together, all this patterns represents 90.63% 
of the entire genes list. These genes change their expressed value around stage 
P7 and except for a minority of 6 genes supporting Pq, P7 and P10 they keep a 
medium expressed value for the rest of the postnatal stages. Our hypothesis is 
that there is a lot of gene activity between the prenatal stage E18 and postnatal 
stage P7 and the authors of the initial study [7] misregarded the intermediary 
stages, between E18 and P7 (be. P0 and P3). Our belief is that a more detailed 
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study including these two stages and also stages between P7 and P14 would 
allow a better classification of the genes. 

The pattern P 4 is not of big interest as the 83 genes supporting this pattern 
have a medium expressed value during the 5 observed stages. 

The rest of the patterns (from P 5 to P 10 ) represent a small number of genes 
or even a single gene like Pg and Pio- 

The patterns Pq and P 7 are supported by genes with highly similar behaviour. 
The difference between the two groups resides in stage P56 when genes from Pg 
are expressed at low level and genes from P 7 are expressed at medium level. 

The two genes supporting P§ are expressed at low level in prenatal stage and 
then expressed at medium level for the rest of the stages (highly similar with 

P 4 ). 

The patterns Pg and Plo are each supported by a single gene, the gene 
623, and respectively 671. We can say that these genes have an “uncommon” 
behaviour. The description for the two genes is given in Table 3. 



Table 3. Details of the two singular genes 



PatternID 


ID 


GB.ACC 


DESCRIPTION 


FUNCTION 


SUBFUNCTION 


P 9 


623 


D45208 


HP C- 1 /sy nt axin 


IMTN 


vesicle 


Plo 


671 


X67668 


High mobility group protein 2 


NNM 





The patterns P 5 is the only pattern containing a stage at low expressed value 
followed by a stage at high expressed value. We list in Table 4 the details of this 
pattern. 



Table 4. Details on the 4 genes supporting the P5 pattern 



ID 


GB_ACC 


DESCRIPTION 


FUNCTION 


SUBFUNCTION 


178 


X02801 


Glial fibrillary acidic protein (GFAP) 


CSC 


cytoskeleton 


462 


AI838274 


3’ end /clone=UI-M- AO0-aby-a-05-0-UI 


EST 




619 


U48398 


Aquaporin 4 


IMTN 


transporter 


896 


X13986 


minopontin 


UC 





We conclude by saying that our study highlighted the esential stages for genes 
expression (activity) in Mouse Cerebellum developement and that these stages 
need further investigation in a more precise study. 
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Appendix. List of All Patterns Extracted from the Dataset 
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Abstract. SciCraft is a general open source data analysis tool which can 
be used in the analysis of microarrays. The main advantage of SciCraft 
is its ability to integrate different types of software through an intuitive 
and user friendly graphical interface. The user is able to control the flow 
of analysis and visualisation through a visual programming environment 
(VPE) where programs are drawn as diagrams. These diagrams consist of 
nodes and links where the nodes are methods or operators and the links 
are lines showing the flow of data between the nodes. The diagrammatic 
approach used in SciCraft is particularly suited to represent the various 
data analysis pipelines being used in the analysis of microarrays. 
Efficient integration of methods from different computer languages and 
programs is accomplished through various plug-ins that handle all the 
necessary communication and data format handling. Currently available 
plug-ins are Octave (an open source Matlab clone), Python and R. 



1 Introduction 

In the fields of biology and medicine there is an increasing need for effective and 
user friendly data analysis software. The microarray technique which enables a 
quantitative description of the transcriptome of cells, generates a large amount of 
data that must be processed and analysed. To accomplish this, powerful methods 
from statistics, artificial intelligence and chemometrics must be employed. The 
field of microarray data analysis is rapidly advancing and there is a need for easy 
access to the best and latest methods. However, this is unfortunately hampered 
by several factors. For instance, commercial packages are often very expensive 
and lack the flexibility to include the latest data analysis methods. When using 
proprietary software it is difficult for scientists in the area to rapidly share newly 
developed methods or inspect interesting algorithms and their implementations. 
Powerful open source alternatives do exists, such as the Bioconductor [1] which 
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is written in R [2-4] however, the threshold for using the Bioconductor for non- 
statisticians and non-programmers is rather high and makes it cumbersome to 
use. In addition, special computer languages such as R, Matlab/Octave, C/C+- 1-, 
Java and Python have their own communities of developers and contributors 
and it is often difficult for them to share source code. A more integrative data 
analysis system is therefore needed which provides a meeting point for both 
users and method developers. To address these challenges we have started an 
open source software project called SciCraft . 1 to create a system which is flexible 
and powerful enough to satisfy the current and future requirements for data 
analysis of microarrays. 

2 SciCraft System Overview 

2.1 Guiding Principles 

The design and implementation of SciCraft is guided by a set of principles and 
ideas which we believe are important for the type of data analysis software we 
have envisioned. Some of the keywords that identify these guiding principles are: 

— Accessibility 

The user should have easy and rapid access to a wide range of different data 
analytical methods. 

— Integration 

To solve complex problems it is often needed to integrate data analytical 
methods from different software packages and computer languages. Manual 
integration often impose a significant extra workload which makes the re- 
search less efficient. Typically integration problems are related to e.g. file 
format conversions, interoperability of programs, operating system compat- 
ibility and memory limitations. The aim is to create seamless integration of 
the different methods such that the user does not need to know whether 
e.g. a FORTRAN, a C++ or an Octave [5] program is employed to solve a 
certain problem. 

— Expandability 

Many advanced users and method developers want to contribute their algo- 
rithms and methods to a given data analysis system, however they often find 
this difficult when the system does not use their chosen computer language. 
Thus the data analysis software should not demand that contributors who 
may have spent years to develop a certain machine learning or statistical 
method in e.g. Lisp or Matlab should convert his/her software to another 
language in order to make it accessible to a larger group of people. 

— Open source 

There are several advantages to using an open source license. One important 
reason is that it enables users to share, inspect and modify the source code 

1 SciCraft was originally referred to as “Zherlock” , but the name was changed due to 
legal reasons 
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without having to get permission from the original author(s) or violating 
proprietary licenses. Other important reasons for choosing open source are 
related to stability [6], price and continuity/ availability. 



2.2 Design Ideas 

To achieve the goal of integrating different technologies, SciCraft works as a 
front-end to several numerical “engines” that actually performs the data han- 
dling and processing. To achieve expandability and accessibility, “engines” are 
selected which are high level languages such as Matlab, Octave, Mathematica or 
R. Stand-alone programs written in e.g. C /C-| — b or Java can also be used, how- 
ever high level languages makes implementation of new data analytical methods 
more efficient. 

SciCraft is made to be open source using the GNU Public License (GPL) [7, 
8]. For this reason it is our policy to only include into SciCraft high level lan- 
guages which are compatible with GPL such as Octave (Matlab clone), R (S- 
PLUS clone) or Python. 

Thus, most computations are performed by sending requests to numerical 
“engines” that run programs written in high level languages. To enable smooth 
interaction with these languages, SciCraft employs various plug-ins that handle 
all the data and command communication with the chosen routines. 

Another important aspect of SciCraft is the use of an intuitive graphical 
interface based on a visual programming environment (VPE) [9-12]. Com- 
puter programs are here represented as diagrams which consist of nodes and 
links (connection lines) . Each node represents a method or an operator and each 
link shows the flow of data. A link is displayed as an arrow to indicate direction 
of the data flow, see Fig. 1. The VPE is a natural choice for data analysis pur- 
poses as tasks often can be regarded as a flow of data trough different filters or 
operators. 

SciCraft is designed in a modular fashion where the main parts are made 
as independent of each other as possible, see Fig. 2. The top graphical user 
interface (GUI) layer interprets the commands from the VPE and handles 2D 
and 3D plotting. The middle layer interprets the syntax of diagrams (called 
module diagrams) and sends the requests for outside “engines” to the plug-in 
layer. Currently plug-ins for the languages Octave, R and Python are supported. 



2.3 Technology Used 

For the programming of the main system we have chosen the following technolo- 
gies: 

— Python is selected as the main language. Python is an interpreted, interac- 
tive, object-oriented open source scripting language invented by Guido van 
Rossum [13, 14]. It is easy to learn, portable across platforms and well suited 
for integration with other computer languages such as C/C-l — |- and Java. 
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Play button Stop button Link 




Fig. 1. This shows the VPE in SciCraft. Programs or data analysis pipelines are drawn 
as diagrams where each node represents a method or an operator and the links indicate 
the flow of data between the nodes. At the right hand side is the “node tree” where 
the user selects what nodes are to be used in the module diagram. The nodes are 
arranged according to classes of data analysis methods and for SciCraft system nodes. 
The structure of the node tree can be specified by the user. 



— Qt [15, 16] is selected as the main GUI library. This is an open source library 
created by the Norwegian company Trolltech AS [17] and forms the basis for 
the KDE [18] desktop manager system in Linux. Qt also runs on Windows. 

— The Visualisation Toolkit (VTK) [19] is selected to handle the 3-D 
graphics. VTK is an open source library created by the company Kitware[20] 
and contains a large number of high level scientific visualisation tools. 

— Qwt [21] is selected as the main library for producing 2D plots. SciCraft 
also uses the PyQwt package for Python bindings to Qwt [22]. 

Some of the desired properties which influenced the choice of these technolo- 
gies were platform independence, language integration, open source and perfor- 
mance quality. 

3 Analysis of Microarrays 

3.1 Pipelines and Diagrams 

In SciCraft the diagrams used to represent computer programs are referred to 
as module diagrams. Often the data analyst wants to combine different meth- 
ods in a pipeline of processing and the visual programming environment is very 
suited for this purpose. It is intuitive and more flexible than the ordinary GUI 
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Fig. 2. The modular structure of the SciCraft software system. 



approach as found in most commercial statistical software packages (such as e.g. 
Unscrambler from CAMO 2 and SPSS from SPSS Inc. 3 ). For the analysis of mi- 
croarrays there are many possible processing pipelines and it is desirable to have 
the most common ones easily available to the users of SciCraft. Fig. 3 illustrates 
a hypothetical pipeline which can be directly represented using the VPE. 

3.2 Sources of Methods 

In principle all available Octave (Matlab), R and Python programs (given local 
restrictions) may be used as methods in SciCraft. Of particular interest here 
are open source and non-commercial toolboxes. In relation to Octave and Mat- 
lab functions for preprocessing and normalisation of microarrays some of the 
available toolboxes that may be included in SciCraft are: 

— MArray[23] 

- MGraph[24] 

- MAANOVA[25] 

— MatArray[26, 27] 

It should be noted that not all of these toolboxes follow GPL, however this 
is strictly not necessary due to the nature of how these numerical “engines” are 
called. 

2 CAMO Process AS, Nedre Vollgt. 8, N-0158 Oslo, Norway 

3 SPSS Inc. Headquarters, 233 S. Wacker Drive, 11th floor Chicago, Illinois 60606, 
USA 



Data Analysis of Microarrays Using SciCraft 



63 




Fig. 3. Illustration of possible data analysis pipelines for analysis of microarrays. Note 
that such pipelines can be directly used in the visual programming environment of 
SciCraft. HCA= hierarchical cluster analysis, PCA= principal component analysis, 
diff. gen. exp. = a non-specified method for finding differentially expressed genes. 



However, open source is to be preferred as it is often necessary to perform 
minor changes to existing Matlab programs to make them Octave compatible. It 
should also be kept in mind that some of the toolboxes require commercial Mat- 
lab toolboxes for running. For more general data analysis and machine learning 
the following toolboxes may be used: 

— Netlab[28] 

— Pattern Classification [29, 30] 

— SOM Toolbox[31] 

— MATLAB Support Vector Machine Toolbox[32] 

For R there are also many packages available related to analysis of microar- 
rays and general data analysis [33]. Perhaps the most comprehensive system is 
the Bioconductor (www.bioconductor.org) which contains powerful methods 
for all steps in the microarray analysis pipeline. 

For Python some packages that may be of interest are: 

— SciPy[34] 

— Biopython [35] 

— PyCluster[36, 37] 

4 Example 

The following analysis is included to demonstrate the functionality of SciCraft 
when using selected chemometric data analysis methods on a microarray data 
set. The data set chosen is described in [38] and was downloaded from[39]. It 
consists of expression patterns of different cell types, where 40 objects are from 
colon tumours samples and 22 are normal colon tissue samples. The samples were 
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analysed with an Affymetrix oligonucleotide array complementary to more than 
6,500 human genes. After preprocessing, normalisation and removal of outliers 
the total number of genes used was 2000. 

In this article the data are subjected to the following chemometric methods: 

— Principal component analysis (PC A) [40] 

— Partial least squares regression (PLSR)[41] 

Both PCA and PLSR have been successfully used in the analysis of microar- 
ray data previously [42-53]. Here PLSR is used as a classification method and 
is sometimes referred to as Discriminant PLSR (DPLSR)[54, 55]. 

The analysis performed consists of the following steps: 

— Read data from a Matlab file (both expression data and class values) 

— Perform a PCA on the expression data only. 

— Plot the sample scores for the two first principal components 

— Perform a PLSR analysis on the expression data where the dependent matrix 

Y contains the label information. 

— Plot the sample scores for the two first PLSR components 

The current VPE setup for the data analysis is shown in Fig. 4. 




Fig. 4. This is a screenshot of the SciCraft module diagram discussed in the example 
(Colon data). Input = node for reading (in the current diagram) Matlab files), xval = 
node for performing PLSR cross validation and returning of optimal model parameters, 
Plot2D = node for creating and combining many types of 2D plots (in this case scatter 
plots of scores matrices), pea = node for principal component analysis. 



Two principal components were extracted which accounted for 36% and 12% 
of the total variance. The PLSR analysis was validated by full cross validation 
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and produced four optimal components. The cross validated error was 8% (5 out 
of 62 samples wrongly classified). 

The scores plot for the PCA and PLSR analyses are shown in Fig. 5. In the 
PCA scores plot (left) the two different classes are only partly separated. In 
the PLSR analysis we have rotated the variable (gene) axes to have maximum 
covariance with the class membership values (0=normal and l=cancer). As can 
be seen the separation between the classes is significantly better with PLSR than 
with PCA. 
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Fig. 5. SciCraft print-out showing the scores values for the samples from the PCA and 
PLSR analyses (“0” = normal, “1” = cancer). It is clearly seen that the samples are 
better separated in the scores plot after a PLSR analysis than PCA. 



5 Discussion 

The main advantage with SciCraft is its ability to integrate and combine a wide 
range of different methods in a seamless manner. The visual programming envi- 
ronment is allowing the user more flexibility and intuitive interaction with the 
data analysis pipeline than in most other data analysis software packages. How- 
ever, there is a price to pay in adopting the design as presented in SciCraft. The 
first disadvantage is the dependency of a large number of third-party software 
packages, such as Octave, R, PyQwt, RPy and so on. This means the installing 
of the program and fixing bugs which depends on third-party programs is more 
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difficult. We try to solve the installation problem by building easy-to-use scripts 
and providing the necessary support to the users. 

The other disadvantage with the approach chosen is speed. One serious bot- 
tleneck in the system is the transporting of data to and from the different nodes. 
The current approach as being used in SciCraft does not create a problem for 
small to medium sized data sets, but will be less efficient for larger microarrays. 
Fortunately, there are several ways to handle this problem and they are currently 
being investigated in the project. One approach is to improve the communica- 
tion between plug-ins and the numerical “engines” and avoid dumping of large 
temporary files to disk. Another way is to optimise how the requests from the 
module diagrams to the plug-in layer are interpreted. 

The current version of SciCraft is 0.9 and can downloaded for Linux (Debian) 
and Windows platforms from www . scicraft . org. 
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Abstract. A new method for constructing gene networks from microar- 
ray time-series gene expression data is proposed in the context of Bayesian 
network approach. An essential point of Bayesian network modeling is 
the construction of the conditional distribution of each random variable. 
When estimating the conditional distributions from gene expression data, 
a common problem is that gene expression data contain multiple missing 
values. Unfortunately, many methods for constructing conditional distri- 
butions require a complete gene expression value and may lose effective- 
ness even with a few missing value. Additionally, they treat microarray 
time-series gene expression data as static data, although time can be an 
important factor that affects the gene expression levels. 

We overcome these difficulties by using the method of functional data 
analysis. The proposed network construction method consists of two 
stages. Firstly, discrete microarray time-series gene expression values are 
expressed as a continuous curve of time. To account for the time de- 
pendency of gene expression measurements and the noisy nature of the 
microarray data, P-spline nonlinear regression models are utilized. Af- 
ter this preprocessing step, the conditional distribution of each random 
variable is constructed based on functional linear regression models. The 
effectiveness of the proposed method is investigated through Monte Carlo 
simulations and the analysis of Saccharomyces cerevisiae gene expression 
data. 

Keywords: Bayesian networks, functional data analysis, P-spline, 
smoothing, time-series gene expression data 



1 Introduction 

With advances in DNA microarray technology, it has become possible to under- 
stand the complicated biological systems on a genome-wide scale. While a large 
number of available gene expression data have been collected, the required statis- 
tical method to analyze such data is still in development. Particularly, estimating 
gene regulatory networks from gene expression data has become an important 
topic in bioinformatics ([1], [2], [3], [6], [7], [10], [13], [16], [21]). The purpose 
of this paper is to propose a new method for constructing gene network from 
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microarray time-series gene expression data in the context of Bayesian network 
approach. 

In Bayesian networks, a gene is regarded as a random variable and a rela- 
tionship between a target gene and its parent genes is represented by a con- 
ditional distribution. Several methods have been proposed for constructing a 
conditional distribution such as multinomial model ( [6] ) , linear regression model 
([7]), nonparametric additive regression model ([13]), radial basis function net- 
work regression model ([2]) and so on. Although these methods partly succeed 
in constructing gene networks, it is still difficult to capture the complicated 
biological systems under the limited number of microarrays. 

When estimating gene network from microarray time-series or static gene 
expression data, a common problem is that gene expression data contain multi- 
ple missing expression values. Due to the missing values, microarray time-series 
gene expression data for individual genes are often measured at different sets of 
time points and have different number of gene expression values. Unfortunately, 
previous methods for constructing conditional distributions are not robust to the 
missing values and require a complete gene expression value. Ideally, this prob- 
lem can be solved by reproducing microarray gene expression data, but this idea 
would be unsupported because of the cost incurred in making microarray gene 
expression data repeatedly. Another problem is that the previous methods can 
not take account for time information. Microarray time-series gene expression 
data are measured to investigate dynamic biological systems and time can be an 
important factor that affects the gene expression levels. Thus, the method which 
can treat missing values and preserve the time dependency of gene expression 
value is needed for analyzing microarray time-series gene expression data. 

We overcome these difficulties by using the method of functional data analysis 
([17], [18]). The proposed network construction algorithm consists of the follow- 
ing two stages. Firstly, as a preprocessing step, a set of discrete gene expression 
values are expressed as a continuous curve. To account for the time dependency 
of the gene expression measurements over time and the noisy nature of the mi- 
croarray data, P-spline nonlinear regression models ([4]) are utilized. P-spline 
nonlinear regression modeling approach is an attractive method for modeling 
nonlinear smooth effect of time and excluding the noise. After this preprocess- 
ing step, the conditional distributions of each random variable are constructed 
by using functional linear regression models and a set of microarray time-course 
gene expression curves. Even if microarray time-series gene expression data for 
individual genes may be measured at different sets of time points and have differ- 
ent number of gene expressions due to the missing values, the proposed method 
easily treats such incomplete data by constructing microarray time-course gene 
expression curves. 

To investigate the effectiveness of the proposed method, we first compare the 
performances of functional linear regression models and ordinal linear regression 
models. We then apply the proposed method to analyze Saccharomyces cerevisiae 
gene expression data as a real application. We show that the proposed method 
estimates more accurate gene network than that of linear regression models in 
both experiments. 
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2 Construction of Continuous Gene Expression Profiles 

Let i = denotes the individual gene index and j = 1 denotes j 

th time-course gene expression measurements. Then Zij(t) is the type j time- 
course gene expression value for gene i measured at time t. The purpose of this 
section is to construct continuous microarray time-course gene expression curves 
Xij(t) based on a set of discrete gene expression values . ), Zij(tij nij )} 

measured at riij discrete time points . . 

In this section, we first review the basic concept of P-spline function. We 
then describe P-spline nonlinear regression models to construct microarray time- 
course gene expression curves by considering the time dependency of the gene 
expression measurements and the noisy nature of the microarray data. 



2.1 Review of B-Spline Basis 

P-spline function Bs(x) is defined as a linear combination of m P-spline basis 
functions that consist of polynomial pieces connected at points called knots: 

m 

Bs(x) = ^ijbjixip), 
i= i 



where bj(x\p) are known P-spline basis of degree p with m+p+1 knots t\ < ... < 
t m + p + 1 and 7 j are unknown parameters. Each P-spline basis can be calculated 
using the de Boor’s recursion formula (de Boor (1978)): 



bj(x, 0 ) 



bj(x;p) 



( 1 ; L/' X <C tj+l j 
( 0, othejwise , 

, X \ bj(x-,p- 1) + tj+p+1 _ 1 b j+1 {x-,p - 1), 

Z j+P l 3 Z j+P + 1 Z j + 1 



where p is the degree of P-spline basis of degree. Since a zero-degree P-spline 
basis is just a constant on one interval between two knots, it is simple to compute 
P-spline basis of any degree. We use P-spline basis with degree 3, and denote 
bj{x\ 3) by bj(x) for simplicity of presentation. 



2.2 Continuous Representation 

Generally, microarray time-series gene expression data for individual genes may 
be measured at different sets of time points and have different number of gene 
expressions due to the missing values. Suppose that we have microarray 
time-series gene expression data {zij(tyiy)> (t*jny)} for i th gene with 
j th experiments measured at discrete time points Uji tj , ...,Uj nij . Considering 
that microarray gene expression profiles contain noise, it is natural to consider 
the observed time-series gene expression data as a set of samples taken from 
underlying continuous smooth process corrupted with noise. 
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We assume that the observed microarray time-series gene expression data are 
expressed as a B-spline curve with additional noise 

Zj j (tjj a ) Xjj (t jjcy ) “t“ Sijai O 1, . . . , Ttjj , 

where £y(f) = Y^k=\lijkbk(t) = 7 y6(ty a ) is the true underlying gene expres- 
sion of i th gene for j th experiments and £y a are experimental noise. Here 
7ij = (7ijii lijm)' is the to dimensional unknown parameter vector and 
b(t) = (61 (t is known B-spline basis vector with degree 3, respec- 
tively. 

The squared residual estimate of 7y is obtained by minimizing the squared 
residual function. In model estimation process, however, we require a good fit to 
the microarray time-series gene expression data, but we also require the fitting 
curve to be smooth to capture the true gene expression process. Instead, we esti- 
mate the unknown parameter qq • by minimizing the penalized squared residual 
function 

Uij ^ 

^(7yi Ay) = y — x(tij a )) — f {x y (t)} dt , 

a=l 

where Ay > 0 is the smoothing parameter, which controls the fitness to the data 
and the model complexity. It is known that the expression f {x'/j (t)} 2 dt can be 
approximated by f {Bs"j(t)} 2 dt « Z^2 3 (^ 2 'Tb'fc) 2 = where A is a 

difference operator such as A'jijk = jijk and B2 is an (to— 2) x in matrix 

that represents the difference operator ([4]). The use of difference penalties has 
been investigated by many researchers ([9], [15], [24]). 

By minimizing the penalized squared residual function £(7^; Ay), the penal- 
ized squared residual estimate is explicitly given by 

7 ij = {BijBij + \ijD 2 D 2 ) B^Zij, 

where zy = (zy(tyi), ...zy ))' is the n dimensional vector and By = 

(b(tyi), ..., is ^e ny x m matrix. 

The fitted curve depends on the number of basis functions to and the value 
of smoothing parameter Ay. In practical aspects, Eilers and Marx ([4]) used of 
a moderately number of basis functions m to ensure enough flexibility, and then 
optimized the value of smoothing parameter to control a roughness penalty Ay 
to guarantee sufficient smoothness of the fitted curves. Thus we fix the number 
of basis functions m and optimize the value of smoothing parameter by using 
cross-validation ([22]). Specifically, the optimal value of smoothing parameter is 
found by minimizing the cross- validated residual sum of squares 

n *i , 2 

CVy = 'y ] ^zy(iya) — 7 y b(tij a ) , 

Ct — 1 

where 7y"°^ denotes the penalized squared residual estimates based on the 
observed sample after first deleting the a th observation. 
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In this section a set of sparse microarray time-series gene expression data 
for i th gene with j th experiments {zij (tyiy), •••, Zij(tijn i:i )} are represented 
as a continuous curve Xij(t ) = ■ £ f i jb(t). This preprocessing step is done for all 
combinations (*,j), i = j = In the next section we construct 

gene regulatory network by using a set of microarray time-series gene expression 
curves. 

3 Gene Regulatory Network Estimation 
via Functional Data Analytic Approach 

3.1 Review of Bayesian Network 

Suppose that we are interested in modeling gene regulatory network G that 
consist of a set of p genes x = (x\, ...,x p )' . In the context of the Bayesian 
network, a set of genes are assumed to be random variables and a directed 
acyclic graph G encoding Markov assumption is considered. The joint density 
function is then decomposed into the conditional density of each variable 

f(x) = Y[f i (x i \p i ), ( 1 ) 

i= 1 

where = {p\ X \ ...,p\ 9 ^)' , i = l,...,p are the ^-dimensional parent genes of 
gene Xi. Through formula (1), the focus of interest in Bayesian networks is how 
to construct the conditional densities fi{xi\p^), i = 1, ...,p. 

In this section, the joint density function (1) with the conditional densities 
/i(xj|pj), i = 1 ,...,p are estimated using n set of time course gene expression 
curves. Under a fixed graph structure, i.e. , a set of parent genes for each gene is 
fixed, we first describe how to construct the conditional density fi{xi\pi) based 
on microarray time-series gene expression curves. However, the problem that still 
remains to be solved is how to choose the optimal graph structure, which gives 
a best approximation of the system underlying the data. Thus we then consider 
the model section problem in the following section. 

3.2 Functional Linear Regression Models 

For a target gene Xi and its parent genes p i = (pu, ■■■,Pi qi )' , suppose we have a 
set of gene expression curves {(xjj(£),p.y-(£)); j = 1 ,...,«} with Xi j(t) = < y ij -6(£ ) 
and Pikjit) = ip ik jb(t ), k = 1, ..., Here 7^ and are the penalized squared 
residual estimates obtained in the previous section and we denote 7^ and il’ikj 
for simplicity of presentation. 

In the functional linear regression modeling approach ([17], [18]), the relation- 
ship between a target gene and its parent genes is characterized by the following 
equation: 

Xij (t) — ^ ] j f3ik(s , t)pikj (s)ds T £ij(t ), 
fc = 1 



i = 1, ...,p, 



(2) 
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where 0ik(s , t) are the bivariate regression coefficient functions that must be esti- 
mated form the data and £ij(t) is a residual function and depends independently 
and normally on mean 0 and variance of. 

Usually, the bivariate regression coefficient functions 0ik{s,t) are modeled 
nonparametrically using basis function approach. Especially, we use the expres- 
sion of Pik(s,t) as a double expansion ([18]): 



lik rn 

Pik(s,t) = EE ’WiklifoCikli (^)^2 (^) — Gfc(^) Wikb(i0 , (3) 

ll — l 02 — 1 

where Wik is the U k x m matrix of parameters and Ci k (s) = (6*fci (t) , bi k i ik (t))' 
is known l & dimensional 13-spline basis vector. Substituting (3) into (2), the 
functional linear regression model (2) can be expressed as a statistical model 
from a class of probability densities 






1 

r— exp 

V 27 TCTi 



WijKt) - ELi ^ ikj RikW ik b{t)} 2 

2^ 



,( 4 ) 



where Rik = f b(s)cik(s)'ds is an m x l & matrix and di is a set of unknown 
parameters, i.e., Wik and erf, respectively. If i th gene has no parents in the graph, 
we assume the Gaussian model with constant mean /q(f) = /q and variance erf. 



3.3 Model Parameter Estimation 



The integrated squared residual estimate of Wik is obtained by minimizing the 
following quantity. 




Qi 

7 ~ E 1 ^ kj R ikW ik b(t ) 

fc= l 



2 

elf- 



in practice, however, the integrated squared residual estimate does not yield 
satisfactory results because the estimated bivariate regression coefficient func- 
tions 0ik(s. t) = Cik(s)'Wikb(t) tend to be under smooth and lead to over fitting. 
To avoid this, a penalty term on the smoothness of the unknown regression 
coefficients is introduced into the integrated squared residual. Specifically, we 
minimize 



\ r 

fc= i ^ 



where A i k is the smoothing parameter that controls the smoothness of 0ik(s,t). 

Given the values of smoothing parameters Aji,...,A i qi and the numbers of 
basis functions mu , ..., rrii qi (i.e., it specifies the dimensions of parameter matrix 
Wik, k = 1, ...,<&), the penalized integrated squared residual estimate of W, = 
( W ' A , ..., W.' q .y can be obtained as the solution of dt(Bi\ A*i, ..., A i qi )/dWi = O. 
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Now let us define the n x m matrix \ Pi k = (ipiki >•••> tftiknY > and use it to define the 
nx (ElLikk) matrix = (\PnRn , . . . , 'Pj qi B . lqi ) . Then the penalized integrated 
squared residual estimate is explicitly given by 



Wi = (X'iXi + Pi) 1 X i r i , 

where T) = (7ii, 7m)' and p i = diag{AaQii, •••, \ qi Qi qi } is the YHk=i l ik 

dimensional block diagonal matrix with Qik = f Ci/ C (s)cik(s)'ds. Then the un- 
known variance is estimated by 




Qi 

7 ij b (t) - ' 'Pikj R ikW ik b{t ) 

fc= 1 



2 

(it. 



Replacing the unknown parameters by their sample estimates W-- L and af 
yields the conditional distribution 






1 

s/PKGi 



exp 



- nu tfikiRij Vikmy 



2a? 



■ ( 5 ) 



Note that the estimated model (5) depends on the values of smoothing pa- 
rameter Aji, A i qi and the numbers of basis functions mu, mt qi . The appro- 
priate values of these parameters are chosen by using cross validation criterion 
given in the next section. 



3.4 Criterion for Selecting Network 

The Bayesian network with functional linear regression models introduced in the 
previous section can be estimated when we fix the network structure. However, 
an optimal network structure is generally unknown and to be estimated. In 
addition, we have to choose the values of smoothing parameters Afi, ..., A i qi and 
the numbers of basis functions m*i, ..., mi qi . 

To complete our scheme, we need some criterion to evaluate its goodness or 
closeness to the true gene network. In this paper, we use the cross validation 
score defined by 

p p n ,■ ( qi } 2 

cv = E GVi = E E / - E Y’ikjRikW^'bit) dt, (6) 

i= 1 i= 1 j = 1 J l fc=l J 



where W^ n> denotes the penalized integrated squared residual estimate based 
on the observed sample after first deleting the a th observation. The optimal 
graph is chosen such that the CV score is minimal. 

The local score 




y ij b(t)-f^tfik j RikWfc ay 



k = 1 



6(f) 



2 

dt 
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evaluates the goodness of the conditional distribution fi(xij{t)\Pij(t): 8,.). How- 
ever, note that the construction of a gene network is to find a set of parent genes 
for each gene, such that the estimated network is acyclic. 

By using the Bayesian network and functional linear regression models to- 
gether with CV criterion, the optimal network is obtained by searching the full 
model space. The optimization of network structure is equivalent to choosing 
the parent genes that regulate a target gene and is a time-consuming task to 
consider all possible gene combinations as the parent genes. We therefore reduce 
the searching space by selecting candidate parent genes ( [6] ) . In detail, the can- 
didate set of parent genes of a target gene Xi are chosen such that it gives small 
CVi scores. Starting form the empty network, a greedy algorithm is employed 
for finding better networks. 

4 Computational Experiment 

To evaluate the proposed method, we first conducted Monte Carlo simulations. 
We then apply the proposed method to analyze Saccharomyces cerevisiae mi- 
croarray time-series gene expression data as a real application. 

4.1 Monte Carlo Simulations 

Monte Carlo simulations are conducted to evaluate the effectiveness of the pro- 
posed method. We compare the performances of functional linear regression mod- 
els and ordinal linear regression models. Since linear regression models require a 
complete gene expression value, we used two treatments of the missing value, one 
is an exclusion approach and another is a missing value imputation approach. 
In the exclusion approach, the column of the microarray gene expression time- 
series data (i.e. , measurements at time), is excluded even with a few missing 
values. On the other hand, the latter approach estimates the missing values in 
gene microarray data. Troyanskaya ([23]) reported that the weighted I\ -nearest 
neighbors provide a more robust and sensitive method for missing value esti- 
mation than both singular value decomposition based method and row average 
method. Thus we utilized the weighted K -nearest neighbors with Euclidean norm 
for estimating the missing values and set K = 10 in these experiments ([23]). 

A set of artificial n = 15 microarray time-series gene expression data were 
generated from an artificial gene regularly network shown in Figure 1 (a) with 
functional structures between the genes: 

Xi(l) = /l(l) + £i(l), £l(l) ~ N( 0, Si), 

x 2 (t) = / 2 (Xi(l)) + e 2 (i), e a (t) ~ JV(0, s 2 ), 

X 3 (t) = l.lXi(l) — 0.9X2 (t) + £ 3 (1), £ 3 (1) ~ X(0, S 3 ), 

X 4 (1) ~ — 1. 2X 2 (t) + £ 4 (1), £ 4 ( 1 ) ~ N(0, S 4 ), 

x 5 ( t ) = 0.7Xi (1) + 0.05cos(Xi(l)) + £5 (1), £5 (1) ~ X(0, S5), 

( 1 T £6 (l) , £6(1) ~ X(0, se) (0.1 <= X 3 (l) and 0.1(1) <= X 5 ) 

X 6 (l) = < — 1 + £6(1), £6(1) ~ X(0, s 6 ) (X 3 (t) <= —0.1 and X 5 (l) <= —0.1) , 

( £6(1), £6(1) ~ X(0, S6) otherwise 
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X 7 (t) = 0.6 X% — 0.7 + £7 (t), £ 7 (t) ~ N( 0, s 7 ), 

Xs(t) = — 0.3-Xi (t) + £&(t), £s (t) ~ JV(0, Sg), 

Xg(t) — 0.6 sin(X 7 (f)) — 0.1 + £g(t), £g(f) ~ JV(0, sg), 

Xi 0 {t) = 1/(1 + exp(-1.2X 8 (t) - 0.04Xg(t))) + ei 0 (t), eio(t) ~ N( 0, si 0 ). 



To generate various microarray time-series gene 
the following expression curves: 

" Casel : fi(t) = t hi 3 ) 

Case2 : fi(t) = t 2 hi 3 ) 

Case3 : /i(f) = sin(f) hi 3 ) 

Case4 : fi(t) = 2t hi 3 ) 

Case5 : fi(t) = 1 — t hi 3 ) 

Case6 : h(t) = cos (t) hi 3 ) 

Case7 : f 1 (t) = -t hi 3 ) 

< Case8 : /i(i) = t 3 hi 3 ) 

Case9 : fi{t) = t 2 hi 3 ) 

CaselO : fi(t) = t + 2 hi 3 ) 

Casel 1 : fi(t) = t 2 — l hi 3 ) 

Casel2 : = t hi 3 ) 

Casel3 : f 1 (t) = -t hi 3 ) 

Casel4 : f 1 (t) = t 2 — 1 hi 3 ) 

k Casel5 : h(t) = -2 + O.Olt hi 3 ) 



expression values, we consider 



= 1 — 5 

= sin 1 (s) 3 
= 5 

= — 5 + 2 
= cos -1 (s) 3 + 2 
= — s 
= s 3 / 2 
= ( S -1) 3 
= 1.5s- 1 
= sin( v / s) 

= cos(s) 

= — sin(27rs) 

= 2v / s^=L 
= 1 — s 



The lengths of time-series for each experiment were randomly set from {7, ..., 15} 
and the amount of noise was randomly set to a signal to noise ratio from 0.2 to 
0.4. We consider the two settings of the missing probability for each experiment, 
one is P = 0 and another is P = 0.1. The observations from the setting of the 
missing probability P = 0.1 are experientially similar to the real microarray 
data. Although the setting of the missing probability P = 0 generates data with 
no missing, it is important to compare the results obtained by P = 0.1 as well 
as the performances of each model. 

Figure 1 represents a typical result of the Monte Carlo simulations under 
the setting of the missing probability P = 0.1. Figure 1 (b) and (c) is an op- 
timal network estimated by a Bayesian network model using functional linear 
regression models and linear regression models, respectively. Although there is no 
connection between Xq and X 7 in Figure 1 (b), the proposed method successfully 
captures the true network. On the other hand, as shown in Figure 1 (c), linear 
regression models fail in capturing the connections Xq — > X 7 , Ag — > X 10 and 
the direction of the edge between Xg and Xio- Due to the nonlinear structures 
between a target gene and its parent genes, it is difficult to detect complicated 
systems based on linear regression models. 

Table 1 compares the results of the Monte Carlo simulations under the two 
settings of the missing probabilities P = 0 and P = 0.1, respectively. The sim- 
ulation results were obtained by 1000 repeated Monte Carlo trials. The number 
attached after node name is the number of estimated connection without direc- 
tion information and the percentage is the correctly identified percentage. For 
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(a) (b) (c) 



Fig. 1. (a) Target pathway, (b) Result of the Bayesian network and functional linear 
regression models, (c) Result of the Bayesian network and linear regression models. 



example, under the setting of the missing probability P = 0, functional linear 
regression models estimated the relationship A'i — » A3 or X3 — >■ X\ 853 times 
from 1000 Monte Carlo simulations and 79 percent of 853 times represents the 
correct direction. 

The numbers of true positive and those percentages of correctly estimated 
edges indicate that the proposed method performs very good. Table 1 also in- 
dicates that the linear regression models fail in capturing the nonlinear connec- 
tions. For example, there is nonlinear structure As —4 A10 and linear regression 
models detected this connection only 548 times from 1000 Monte Carlo sim- 
ulations. On the other hand, functional linear regression models detected the 
connection As — > A10 895 times from 1000 Monte Carlo simulations and cap- 
tured the complicated effects of parent genes. 

Comparing with the results of P = 0, the performance of linear regression 
models under the setting of the missing probability P = 0.1 reduced considerably. 
Although the performance of functional linear regression models also reduced, 
the proposed method still captures the true network structure very well. This 
is because the proposed method preserves the time dependency of microarray 
time-series gene expression data and has robustness to the missing values. Thus 
we can expect that the proposed network estimation method can work effectively 
in the real data analysis. 

4.2 Real Data Analysis 

In this section we show the effectiveness of our proposed method through the 
analysis of cell cycle gene expression data ([8]). A dataset of 173 microarrays 
measured the response of Saccharomyces cerevisiae to various stress conditions 
and we used n = 16 microarray time-series gene expression data. 

To evaluate the accuracy of estimated gene networks, we choose 90 genes from 
KEGG pathway database of Saccharomyces cerevisiae cell cycle ([14]) and com- 
pare the performances of functional linear regression models and linear regression 
models. Since linear regression models require a complete gene expression values, 
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Table 1. The results of Monte Carlo simulations for functional linear regression mod- 
els (FLRM), linear regression models (LRM), linear regression models with exclusion 
approach (LRMEX), and linear regression models with missing value imputation ap- 
proach (LRMIMP), respectively. 







P -- 


= 0 








P = 


= 0.1 






FLRM 


LRM 


FLRM 


LRMEX 


LRMIMP 


X\ -> x 2 


853 


(79%) 


831 


(81%) 


761 


(77%) 


742 


(75%) 


737 (76%) 


Xi -► x 3 


937 


(86%) 


923 


(84%) 


891 


(83%) 


882 


(82%) 


883 (83%) 


A'l -► X 5 


999 


(91%) 


994 


(83%) 


995 


(88%) 


983 


(87%) 


990 (88%) 


Ah -f X 8 


998 


(90%) 


988 


(92%) 


981 


(87%) 


951 


(87%) 


975 (87%) 


x 2 -* x 3 


916 


(84%) 


916 


(83%) 


869 


(84%) 


855 


(84%) 


842 (83%) 


x 2 -+ x 4 


998 


(91%) 


995 


(86%) 


992 


(90%) 


958 


(88%) 


967 (88%) 


x 3 -► x 6 


953 


(88%) 


922 


(84%) 


912 


(83%) 


801 


(79%) 


859 (81%) 


x 5 -► x 6 


904 


(84%) 


873 


(79%) 


884 


(80%) 


705 


(76%) 


773 (78%) 


X 6 -► X 7 


726 


(78%) 


539 


(73%) 


692 


(74%) 


456 


(68%) 


490 (70%) 


X 7 -+ X 9 


988 


(91%) 


976 


(83%) 


878 


(85%) 


864 


(82%) 


866 (81%) 


X 8 Xio 


895 


(83%) 


548 


(74%) 


825 


(81%) 


438 


(67%) 


467 (71%) 


X 9 -> Xio 


997 


(90%) 


894 


(84%) 


931 


(87%) 


685 


(78%) 


764 (79%) 



the missing values were estimated by the weighted JL-nearest neighbor approach 
([23]). Although the value of I\ can be determined by some criterion, we set 
K = 20 in this analysis. Troyanskaya ([23]) reported that this value makes the 
weighted RT-nearest neighbor approach to be insensitive to the missing values. 

Figure 2 (b) shows a part of the estimated gene network based on the pro- 
posed method. The edges in the dashed circles can be considered as the correct 
edges. As shown in Figure 2 (b), we find that the proposed method succeeded 
in finding some known connections. 

For instance, the proposed method correctly estimated the connection be- 
tween ESC5 (YJL076W) and DBF20 (YPR111W). Contrary to this, linear re- 
gression models could not detect this connection. Figures 3 (a) and (b) plot the 
estimated continuous curve of microarray time-series gene expression data ob- 
tained from the experiments of heat shock from 25 °C to 37 °C and nitrogen 
depletion, respectively. We can see from Figures 3 (a) and (b) that there is a pro- 
portionality relation between the gene expression amounts of ESC5 and DBF20. 
Figure 3 (c) plots the estimated coefficient function (3(s,t ). The shape of the 
estimate of /3(s,f) indicates that the coefficient to an input signal is changing 
periodically. 

Figure 2 (b) indicates that SIC1 (YLR079W) regulates CDC15 (YAR019C). 
Interestingly, it is reported that there is a connection between these two genes 
([5]). Figures 4 (a) and (b) plot the estimated continuous curve of microarray 
time-series gene expression data obtained from the experiments of heat shock 
from 25 °C to 37 °C and stationary phase, respectively. There is a relation of an 
inverse proportion to the gene expression amounts of SIC1 and CDC15. Figure 
2 (b) also indicates that SIC1 (YLR079W) regulates CDH1 (YGL003C). It is 
also known that there is a connection between these two genes ([20]). In fact, 
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(a) (b) (c) 



Fig. 2. Cell cycle pathway compiled in KEGG. (a): Target network, (b): Result of the 
proposed method, and (c): Result of linear regression models. 




Fig. 3. The estimated continuous curve of microarray time-series gene expression data 
obtained from the experiments of (a) heat shock from 25 °C to 37 °C and (b) nitrogen 

depletion, respectively. The solid lines ( — ) and the dashed lines ( ) represents the 

estimated continuous curve of ESC5 and DBF20, respectively. The estimated coefficient 
function (5 (s,t) is plotted in Figure (c). 



Figures 5 (a) and (b) indicate that the estimated continuous curves of microarray 
time-series gene expression data respectively obtained from the experiments of 
heat shock from 25 °C to 37 °C and stationary phase show very close relations. 
Although KEGG pathway does not include these relationships, we succeeded in 
finding those important relationships based on the proposed approach. 

5 Conclusion 

In this paper we proposed a new statistical method for estimating a genetic 
network from microarray time-series gene expression data by using a Bayesian 
network and functional linear regression models. When estimating the condi- 
tional distributions from gene expression data, a common problem was that 
gene expression data contain multiple missing expression values. Additionally, 
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Fig. 4. The estimated continuous curve of microarray time-series gene expression data 
obtained from the experiments of (a) heat shock from 25 °C to 37 °C and (b) stationary 
phase, respectively. The solid lines (- ) and the dashed lines (- - -) represents the 
estimated continuous curve of SCC1 and CDC15, respectively. The estimated coefficient 
function /3(s,t) is also plotted in Figure (c). 





(a) 



(b) 



(c) 




Fig. 5. The estimated continuous curve of microarray time-series gene expression data 
obtained from the experiments of (a) heat shock from 25 °C to 37 °C and (b) stationary 
phase, respectively. The solid lines ( ) and the dashed lines (- - -) represents the 

estimated continuous curve of SCC1 and CDH1, respectively. The estimated coefficient 
function /3(s,t) is also plotted in Figure (c). 



previous methods treat microarray time-series gene expression data as static 
data, although time can be an important factor that affects the gene expression 
levels. 

The key idea of the proposed method is to express discrete microarray time- 
series gene expression values as a continuous curve of time and then estimate 
the conditional distributions of each random variable based on functional linear 
regression models with microarray time-course gene expression curves. To ac- 
count for time dependency of the gene expression measurements and the noisy 
nature of the microarray data, P-spline nonlinear regression models were uti- 
lized. Even if microarray time-series gene expression data for individual genes 
may be measured at different sets of time points and have different number of 
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gene expressions due to the missing values, the proposed method easily treats 
such incomplete data by constructing microarray time-course gene expression 
curves. 

From the Monte Carlo simulations, we can conclude that the proposed method 
can estimate more accurate networks than Bayesian network with linear re- 
gression models. This reason is that the proposed method preserves the time 
dependency of microarray time-series gene expression data and has robustness 
to the missing values. We also investigated the effectiveness of the proposed 
method through the analysis of Saccharomyces cerevisiae gene expression data 
and evaluated the resulting network by comparing with biological knowledge. 
The proposed method successfully extracted the effective information and we 
can find this information in the resulting genetic network visually. We use a sim- 
ple greedy algorithm for learning network. However, this algorithm needs much 
time for determining the optimal graph. Hence, the development of a better al- 
gorithm is one of the important problems and we would like to discuss it in a 
future paper. 
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Abstract. Automated extraction of information from biological literature prom- 
ises to play an increasingly important role in text-based knowledge discovery 
processes. This is particularly true in regards to high throughput approaches 
such as microarrays and combining data from different sources in a systems bi- 
ology approach. We have developed an integrated system that combines pro- 
tein/gene name dictionaries, synonymy dictionaries, natural language process- 
ing, and pattern matching rules to extract and organize gene relationships from 
full text articles. In the first phase full text articles were collected from 20 peer- 
reviewed journals in the field of molecular biology and biomedicine over the 
last 5 years (1999-2003). The extracted relationships were organized in a data- 
base that included the unique PubMed ID and section id (abstract, introduction, 
materials and method, and results and discussion) to identify the source article 
and section from which concepts were extracted. The system architecture, its 
uniqueness and advantages are presented in this paper. It is hoped that the re- 
sulting knowledge base will assist in the understanding of gene lists generated 
from microarray experiments. 



1 Introduction 

Recent advances in the areas of genomics and proteomics have become increasingly 
dependent on high throughput expression arrays. Analysis and data mining of these 
experiments yield lists of genes or proteins that may not have a readily apparent rela- 
tionship. The research literature is an obvious source to help uncover these relation- 
ships. On the other hand, the immense growth of research literature in the field of 
molecular biology and biomedicine requires efficient automated methods to capture 
and store the information. For example, the biological abstract database Medline [1] 
alone currently contains 13 million abstracts and is currently growing by 4% per year. 
Both abstracts and full text articles are important sources of textual data; abstracts are 
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readily available but have limited content, whilst full text provides a richer content 
but is more difficult to obtain, organize and analyze. 

Research in the field of Information Extraction (IE) has been focused on develop- 
ing techniques for extracting useful information from semi-structured text. The aim of 
IE [2] is to automatically identify relevant fragments of information in semi- 
structured text sources and to represent these fragments in a highly structured form 
that can be subsequently stored in a searchable database. 

Many approaches have been proposed for information extraction from biological 
text (both abstracts and full text). The identification of biomedical entities (e.g. pro- 
tein, gene, small molecule) is often the first level extraction task. Techniques used for 
this task include rule-based methods [3-5], dictionary-based methods [6-8], and ma- 
chine-learning methods [9-13]. Additionally some interesting mixed or hybrid meth- 
ods [14] have also been described. The rule-based systems are based on surface clues 
and lexical context of the documents. Dictionary-based approaches utilize defined 
terms to identify the occurrence of names in a text by means of various substring 
matching techniques. Machine learning approaches rely on the presence of an expert- 
annotated training corpus to automatically drive the identification rules by means of 
various statistical algorithms. 

In next level, complex IE tasks involving the extraction of relational information 
between entities have been addressed. For example, protein interactions [15-20], gene 
annotations [21], relation between genes and drugs [22] and identification of meta- 
bolic pathways [23]. The techniques vary from using pattern-matching rules libraries 

[15.16.21] , to shallow parsing techniques [17,18,19] and to full sentence parsing [20]. 
Pattern-matching systems rely on matching of pre-specified templates (patterns) or 
rules (such as precedence/following rules of specific words). Shallow parsers perform 
partial decomposition of a sentence structure to identify certain phrasal components 
and extract local dependencies between them without reconstructing the structure of 
the entire system. On the other hand, full sentence parsers deal with the structure of 
entire sentences. The reported recall and precision of these systems ranges from 60% 
to 95%. However, most of these systems represent a particular kind of interaction 

[19.21] or small set of data [15,16,17], For example, Sekimizu and colleagues [21] 
extract relations associated with seven frequently occurring verbs (activate, bind, 
interact, regulate encode, signal, and function) found in Medline abstracts. The preci- 
sion and recall performance of this system ranged from 67.8% to 83.3% depending on 
the particular verb used. Ono et al. [15] have reported a precision of 94% and a recall 
of 83% for text related to the Yeast genome only. Pustejovsky et al. [T9] reported 
precision of 90% and recall of 57% using 500 hand-annotated corpora for inhibition 
relations alone. 

Other IE tasks involving the construction of a knowledge base or information base 
of protein/gene information have also been addressed [24-26]. For example, Gene- 
Ways [24] focuses on extracting molecular interactions pertinent to signal transduc- 
tion pathways and Medstract [25] focuses on protein inhibition relations from Med- 
line abstracts. However, these systems were designed to address a single well-defined 
area or task [24] or use simple abstracts only [25,26], and lack a generalised knowl- 
edge base, thus limiting their application. However, the development of high through- 
put platform technologies such as microarrays [27,28], which record expression levels 
of thousands of genes simultaneously, require knowledge of a large number of genes, 
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their interactions and relationships with other biological entities. It is therefore neces- 
sary to develop information extraction methods that are able to identify both a wide 
variety of relationships and biological entities, from full text articles. 

In this paper, we describe the development of a large scale knowledge base using 
pattern-matching rule libraries to perform automated extraction of gene relationships 
from full text articles. This has been designed as a general information extraction 
system and thus it is not restricted to a small number of entity types and relationships 
that pertain to a particular application. A distinct advantage of this system is that users 
can extract gene relationships from full text articles for their own set of genes of in- 
terest. We also describe the application of our knowledge base to the understanding of 
relationships found in gene lists generated from microarray experiments. 



2 Overall System Architecture 



Our goal is to extract gene relationships from full text articles using natural language 
processing and pattern matching rules and label with their PubMed ID and section ID, 
and to store in a relational database that can be queried with gene lists generated from 
microarray experiments. The key steps, system components, and dataflow are illus- 
trated in Figure 1 . 



x£ 






Fig. 1 . Key components and processing steps 



We designed and built our system using custom-built text processing tools for 
document preprocessing, LexiQuestMine [29] - a commercially available natural 
language processing (NLP) tool from SPSS (Chicago, IL) for natural language proc- 
essing and Microsoft SQL Server for our relational database design. In the Methods 
Section implementation of each component is described in more detail. 
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3 Methods 

3.1 Collection of Full Text Journal Articles 

We collected full text articles published in 20 peer-reviewed journals in the field of 
molecular biology and biomedicine over the last 5 years (1999-2003) (Table 1). The 
journals were selected in accordance with research interests in brain tumors. Addi- 
tional selection criteria included journal impact and representation of different elec- 
tronic publishers. A commercially available software package, GetltRight [30], was 
used with our scripts to automatically connect and download full text articles from 
Northwestern University Libraries and corresponding journal sites. All articles were 
downloaded as HTML text. To save storage space, figures were not included in the 
downloads. 



Table 1 . List of downloaded journals and publisher’s web sites 



Journal Name 


URL 


Biochemistry 


http://pubs.acs.org/journals/bichaw/ 


BBRC 


http://www.sciencedirect.com/science/journal/0006291X 


Brain Research 


http://www.sciencedirect.com/science/journal/00068993 


Cancer 


http://www3.interscience.wiley.com/cgi-in/jhome/28741 


Cancer Research 


http ://cancerre s . aacr j ournals . org 


Cell 


http://www.cell.com/ 


EMBO Journal 


http://embojournal.npgjournals.com/ 


FEBS Letters 


http://www.sciencedirect.com/science/journal/00145793 


Genes and Development 


http://www.genesdev.org/ 


International Journal of Cancer 


http://www3.interscience.wiley.com/cgi-in/jhome/29331 


Journal of Biological Chemistry 


http://www.jbc.org/ 


Journal of Cell Biology 


http://www.jcb.org/ 


Journal of Neuroscience 


http://www.jneurosci.org/ 


Nature 


http://www.nature.com/ 


Neuron 


http://www.neuron.org/ 


Neurology 


http://www.neurology.org/ 


Nucleic Acid Research 


http://nar.oupjournals.org/ 


Oncogene 


http://www.nature.com/onc/ 


PNAS 


http://www.pnas.org/ 


Science 


http://www.sciencemag.org/ 



3.2 Processing of Journal Articles 

In preparation for text mining, downloaded articles were formatted to permit efficient 
concept extraction. In contrast to simple abstracts, full text articles require a range of 
formatting to be performed, such as the removal HTML tags, replacement of special 
characters like Greek symbols (e.g. a — * alpha), and expansion of abbreviations. For- 
matting was accomplished by the generation of scripts for each task. 

Once the initial formatting was performed, each article with its unique PubMed ID 
was inserted as an XML tag. This was done using a string matching script which 
compares the title tag of each article with another file which contains title and Pub- 
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Med ID created from PubMed. XML section tags (abstract, introduction, methods and 
results and discussion) were also inserted at the beginning and end of each section of 
the article so that each section could be identified separately. After the insertion of 
XML tags a check script ensured the proper insertion of all the four section tags (ab- 
stract, introduction, methods and results and discussion) in each journal. The generic 
structure of the full-text documents after preprocessing is shown in Table 2. 

Table 2. A generic view of a full-text document after preprocessing 

<?xml version='1.0'?><Doc> 

<MedlineID>12514136</MedlineID> 

<Title> 

Determinants in mammalian telomerase RNA that mediate 
enzyme processivity and cross-species incompatibility 

</Title> 

<Abstract> 

Abstract of document here .... 

</Abstract> 

<Introduction> 

Introduction of document here .... 

</Introduction> 

<Methods> 

Materials and methods section of document here . . . 

</Methods> 

<Results> 

Results and discussion section of document here 

</Resultsx/Doc> 



3.3 Natural Language Processing 

Once full text formatting was completed, natural language processing (NLP) of the 
text was performed to extract gene annotations. NLP includes the following steps: 

• tokenize the text into sentences; 

• parse the sentences to identify noun phrases and verb phrases; 

• select sentences which may contain gene annotations using provided gene/protein 
name, relation and synonyms dictionaries; 

• extract gene annotations using pattern matching rules. 

The NLP tool LexiQuestMine [29] from SPSS used for this extraction has an op- 
tion for using custom-built dictionaries to identify user-desired concepts and patterns. 
In our case, it was augmented with gene/protein names, synonyms and patterns asso- 
ciated with gene annotations. 



3.4 Gene/Protein Name, Synonyms and Relation Dictionary Creation 

We used a dictionary-based approach to identify gene/protein names in the texts. We 
created two dictionaries, one describing protein and gene names and the other defin- 
ing synonyms, to identify the sentences containing one or more gene/protein names. 
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The gene/protein dictionaries were compiled using various external database sources, 
including LocusLink [31], Genecards [32], Swissprot [33], GoldenPath [34], and 
HUGO [35]. Similarly, two further dictionaries were created, one with interaction 
verbs and another defining their variations as synonyms (e.g. inhibit - inhibits, inhibi- 
tion, inhibited). Currently, the relation and relation synonyms dictionaries are created 
using semi-manual methods. We first extracted the verb phrases using a pilot corpus 
of 1000 articles and the variations were added manually by a domain expert. Further, 
we used two additional knowledge dictionaries, which contain some contextual clues 
such as prefix and suffix terms (e.g. kinase, phosphate, receptor) to improve the accu- 
racy. 



3.5 Recognition of Gene Relationships 

The sentences identified by name dictionaries containing one or more gene/protein 
names and relations were parsed by rich set of pattern matching rules to extract gene 
annotations from text. The rules were based on the arrangement of gene/protein 
names, prepositions and keywords that indicate the type of relationship between 
genes. We define patterns using nouns describing agents, passive verbs, active verbs 
and nouns describing actions. An example for each pattern is illustrated in Table 3. 



Table 3. An example set of word patterns 



Type: 

Pattern: 

Sentence: 

Output: 



Nouns describing agents 

($gene (is)? (the|an|a) @ [0,2}$action of @{0,2} $gene) 

IL6, a known mediator of STAT3 response 

Interleukin 6 STAT3 mediates 



Type: 

Pattern: 

Sentence: 

Output: 



Passive verbs 

($gene @{0.6} (is|was|be|are|were) @{0,1} $acdon $(by|via|through) 
@{0,3} Sgene) 

Protein kinase c (PKC) has been shown to be activated by parathyroid 
hormone 

Parathyroid hormone pkc activates 



Type: 


Active verbs 




Pattern: 


($gene $sub-action @{0,1} 


Saction @{0,2} Sgene) 


Sentence: 


Insulin mediated inhibition of hormone sensitivity lipase activity 


Output: 


insulin lipase 


inhibits 



Type: 

Pattern: 

Sentence: 

Output: 



Nouns describing actions 

($gene @{0,6} $action (of|with) @{0,1} $gene) 

abi5 domains required for interaction with abi3 

abi5 abi3 interacts 



3.6 Relational Database Design 

We developed a relational database in Microsoft SQL to facilitate the integration of 
our extracted relations with Medline records. This will allow the user to search gene 
annotations using the set of diverse fields available in Medline records. Our relational 
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database comprises two tables: the extracted relations with PubMed ID and section 
ID; and the key fields from Medline records PMID, title, authors, source, publication 
type, date of publication, and MeSH key words. For articles containing more than five 
MeSFI terms, only the first five terms were included. We defined an entity-relation- 
ship model [36] for joining two tables. The tables were integrated based on the com- 
mon attribute PubMed ID. The final output record for each extracted annotation with 
some key fields from Medline is shown in Table 4. 



Table 4. Sample output record associated with each gene annotation 



PubMed ID 


12807881 


Gene annotation 


cytochrome c activates caspase-3 


Section ID 


Results and Discussion 


Date 


2003 Aug 22 


Title 


The Ret finger protein induces apoptosis via its RING finger-B 
box-coiled-coil motif. 


MeSH 


Amino Acid Motifs, Apoptosis/*physiology, Caspases/ metabolism, 
Enzyme Activation, Human, Mitogen-Activated Protein Kinases/ 
metabolism 


Source 


J Biol Chem 2003 Aug 22;278(34):3 1902-8. 


Authors 


Dho SH, Kwon KS 


Publication type 


Journal Article 



4 Results and Discussion 

By using full text journals we have created a generalised information base thus in- 
creasing the range of applications for which it can be used and enabling users to ex- 
tract annotations for their own, user-defined set of genes. 

4.1 Creation of Gene Relationship Database 

Article Processing. The primary text of each article was identified by XML tags to 
facilitate text mining and identify where an extracted gene relationship is found. This 
also prevented text mining on sections such as literature cited, or references, which 
can confound the results. The variations used by different journals and publishers 
were taken into consideration. For example, methods, materials and methods, experi- 
mental procedures, patients and methods, systems and methods etc. were all tagged 
with the single term methods. Initially we tagged Results and Discussion as two sepa- 
rate sections with separate tags; however, we found that 25% of articles these two 
sections were included as a single section. Therefore, both elements were tagged as a 
single section and labeled Results. Articles that did not conform to the four section 
structure were found to be mainly review articles and were tagged with a single tag 
Review at the beginning and end of the document. These articles were moved to a 
separate folder for review articles. Further, the download agent download all articles 
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from each journal, including editorials, letters to editor, errata etc., which were not 
necessary for mining process. Articles without PubMed IDs usually fall under these 
categories and were discarded from the collection. 

Creation of Synonymy Dictionaries. A major obstacle in information extraction 
from biomedical literature is the variety of names each gene or protein is known by, 
these names can change over time. We created a synonymy dictionary with a pre- 
ferred name for each gene or protein. This dictionary was compiled on a basis of Lo- 
cusLink as primary source and from other publicly available databases as described in 
the methods section. The present gene dictionary consists of 282,882 unique 
gene/protein names. The resulting synonyms dictionary consisted of an additional 
274,845 synonym names. Similarly, in relation dictionary, we first try the stemming 
of verbs but our test run results were not encouraging. So, we manually added the 
variations using the help of domain expert. The current relation dictionary consists of 
124 relation verbs and their variations as synonyms. 

Gene Relation Database. In the gene relation database, the gene relations were la- 
beled with the PubMed ID and section ID using the pre-inserted XML tags. The final 
gene relation database structure is illustrated in Table 5. This will help users to get 
corresponding gene annotations from full text articles for their initial PubMed query 
results. 



Table 5. Sample output of gene relation database 



PubMed ID 


Gene 1 


Gene 2 


Relation 


Source 


12881431 


APOBEC2 


AICDA 


mediates 


Abstract 


12101418 


NS5ATP13TP2 


P53 


Inhibits 


Introduction 


15131130 


map3k7 


nf-kappa b 


activates 


Methods 


12154096 


lRf-1 


Pkb 


activates 


Results 



Gene Relation Database to Relational Database. The final relational databases 
contains additional fields such as MeSH, author name, journal name, publication type 
from Medline enabling users to obtain the gene relations from full text articles corre- 
sponding to the particular category they are interested, for example, a query that con- 
tains the MeSH key word human prognosis receptors [MeSH] will extract the gene 
annotation records pertaining to the above MeSH keywords. Similarly, queries based 
on other fields such as source, publication type, section ID, etc. or combinations of 
one or more fields will retrieve gene annotations related to the particular query. Fur- 
ther, we are planning to integrate our relational database with other hand-curated gene 
relation databases such as DIP [37] and KEGG [38] using our unique PubMed ID. 
This makes our system a complete knowledge base of gene relations. 

4.2 Evaluation 

In order to test the accuracy of our system, we developed a test corpus of 100 ran- 
domly selected articles, five from each journal and one for each year (1999-2003). 
The corpus was manually analyzed by 10 neuro-biologists within our laboratory and 
the results were compared with automatically extracted results. Analysis of the corpus 
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revealed a total of 141 distinct references to gene annotations. The effectiveness our 
system is measured in terms of precision and recall [39] - these are defined in equa- 
tions ( 1 ) and (2) below. Precision is a measure for the degree of correctness and recall 
is a measure for the degree of completeness. 

number of relevant items retrieved TP , , 

precision= = ' > 

total number of items retrieved (TP+FP) 

number of relevant items retrieved TP (2) 

recall= = 

total number of relavant items (TP+FN) 

Where TP, FN and FP are defined as: 

TP - Number of annotations that were correctly identified by the system 
FN - Number of annotations that the system failed to identify 

FP - Number of annotations identified by the system but were not found in the corpus 
Analysis of the output generated by the system yielded precision and recall rates of 
63.5% and 37.3%, respectively. The precision and recall level of our system is rela- 
tively low when compared to other systems. The low recall probably because the 
chosen corpus is quite large, consisting of full text article from 20 different journals 
for the last four years (1999-2003). The low precision is due to current pattern errors 
in identifying gene/protein names in some sentences. Further, currently our system 
does not contain multi-word patterns to explore complex sentences. Direct compari- 
son of our work to the work of others is not possible, since the corpora used for 
evaluation criteria and our objectives differ. 

4.3 Abstracts vs. Full Text Paper 

It is obvious that the number of facts found in full text is much higher than the corre- 
sponding abstracts. As our system focuses on extraction of relations from full text 
articles, we compared the number of patterns present in each section of an article. 
Table 6 shows the total number of patterns extracted from each section of the article 
for one year (2003) of the EMBO journal, which contains 605 full text articles. As 
expected, the section results and discussion contains a large number of patterns as 
does the section abstracts. In addition, there appears to be at least twice as may pat- 
terns extracted from the full text as from abstracts alone. This indicates that half of the 
potential gene relationships are missed by mining only abstracts. 



Table 6. Total number of patterns extracted from each section 



Section 


Abstract 


Introduction 


Methods 


Results and Discussion 


Total number of 
patterns 


479 


306 


279 


532 



4.4 Example Application to Gene Expression Data 

As a pilot project, we examined the gene lists of an experiment designed to look at 
differential gene expression in a glioma cell line in response to two mitogenic stimuli 
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(EGF and SIP). Affymetrix gene chip analysis was performed on triplicate samples 
for each condition (resting, EGF, and SIP and the average expression data analyzed 
as previously described [40]). The gene lists were grouped into genes differentially 
expressed in response to both stimuli and list unique to each EGF and SIP. Each gene 
name was compared to the concepts extracted from full text articles representing the 
past five years of six premier biomedical journals. This analysis resulted in a high 
correlation between genes in the common list and the biological processes of cell 
cycle and proliferation. This was to be expected since both EGF and SIP stimulate 
cell proliferation. The gene list unique to SIP stimulation resulted in a high correla- 
tion with the concept of cell stress response and IL-6 signaling. This correlation was 
not immediately obvious from gene ontologies or text analysis of abstracts. These 
data suggest that full text analysis can be incorporated into the data mining pipeline of 
microarray analysis and that it is more helpful than gene ontologies and text analysis 
of abstracts for understanding the underlying biology. 

4.5 Current Status 

While the initial “prototype” has shown utility, it is recognized that on-going mainte- 
nance is required. The current sources of functional gene and protein data are often in 
the form of manually curated databases, and so, do not provide ready access to the 
information required to advance discovery. Is the insight really “new” if someone else 
has already processed the information? To address this concern, both a “raw extrac- 
tion” process and an “empty pattern” examination process are in place. This is to say 
that information about new terms and newly identified relationships between terms 
are published daily and it is necessary to discover both previously unrecognized con- 
cepts and newly identified relationships between concepts as they are published. To 
that end, we exercise the SPSS LexiQuest Mine NLP process, without forcing extrac- 
tion on predefined terms, to extract concepts that we simply do not know how to 
“query” for. New concepts are processed in an automated fashion, using a series of 
comparison and filtering steps, via the Clementine data mining workbench [41] and 
our custom dictionaries are updated with only the truly new information. Addition- 
ally, the pattern-matching engine is applied to the entire index of extracted concepts. 
Our process allows for the fast identification of these new functions and interactions 
and the rapid updating of the knowledge base. 



5 Conclusion 

We have shown a fully automated way of constructing large-scale knowledge bases of 
gene relationships from full text articles using gene/protein name dictionaries, natural 
language processing and powerful pattern-matching rule libraries. As a unique feature 
each relationship was integrated with PubMed ID and section ID to identify the 
source article and section from which each relationship is extracted. This was devel- 
oped as part of our ongoing research to integrate pediatric brain tumor gene expres- 
sion data with additional data sources including genetic data, biomedical literature, 
and clinical information. The 20 initial journals in our list are selected in accordance 
with this project. Further, we have a plan to compare our studies in abstract vs. full 
text using this corpus. 



94 Eric G. Bremer et al. 



References 

1. National Library of Medicine's bibliographic database at http://www.ncbi.nlm.nih.gov 

2. Cowie, J., & Lehnert, W.: Information extraction. Communications of the ACM, 39, 80- 
91, (1996) 

3. Fukuda, K., Tsunoda, T., Tamura, A., & Takagi, T.: Towards Information Extraction: 
identifying protein names from biological papers. Pacific Symposium on Biocomputing, 
707-718, U998) 

4. Eriksson, G., Franzen, K., Olsson, F.: Exploiting syntax when detecting protein names in 
text, Workshop on natural language processing in Biomedical Applications, (2002) 

5. Narayanaswamy, M., Ravikumar., K.E., Vijay-shankar, K.: A Biological Named Enitity 
Recognizer, Pacific Symposium on Biocomputing, 8, 427-438, (2003) 

6. Krauthammer, M., Rzhetsky, A., Morozov, P., Friedman C.: Using blast for identifying 
gene and protein names in journal articles, Gene, 245-152, (2000) 

7. Hanisch, D., Fluck, J., Mevissien, DT., Zimmer, R.: Playing Biology’s Name Game: Iden- 
tifying protein names in scientific text, Pacific Symposium on Biocomputing, 8, 403-414, 
(2003) 

8. Egorov, S., Yuryev, A., & Daraselia, N.: A simple and practical dictionary based approach 
for identification of proteins in Medline abstracts, JAMIA, 11(3), 174-178, (2004) 

9. Hatzivassiloglou, V., Duboue, P.A., & Rzhetsky, A.: Disambiguating proteins, genes, and 
RNA in text: a machine learning approach. Proceedings of the 9th International Confer- 
ence on Intelligent Systems for Molecular Biology, 97-106, (2001) 

10. Wilbur, W et al.: Analysis of biomedical text for biochemical names: As comparison of 
three methods, AMIA symposium, 176-180, (1999) 

11. Collier, N., Nobata, C., & Tsujii, T.: Extraction of name of genes and gene products with a 
Hidden Markov Model, COLING conference proceedings, 201-207, (2000) 

12. Kazama, J., Makino, T., Ohta, Y., & Tsujii, J.: Tuning Support Vector Machines for 
Biomedical Named Entity Recognition, Proceedings of the Natural Language Processing 
in the Biomedical Domain, Philadelphia, PA, USA, (2002) 

13. Chang, J.T., Schutze, H., & Altman R.B.: GAPSCORE: finding gene and protein names 
one word at a time. Bioinformatics, 20, 216-225, (2004) 

14. Tanabe, L., & Wilbur, J.: Tagging gene and protein names in biomedical text. Bioinfor- 
matics, 18, 1124-1132, (2002) 

15. Ono, T., Hishigaki, H., Tanigami, A., & Takagi, T.: Automated extraction of information 
on protein-protein interactions from the biological literature, Bioinformatics, 17, 155-161, 
( 2001 ) 

16. Wong, L.: A protein interaction extraction system, Pacific Symposium on Biocomputing, 
6,520-531,(2001) 

17. Humphreys, K., Demetriou, G., & Gaizauskas, R.: Two applications of information extrac- 
tion to biological science journal articles: enzyme interactions and protein structure, Pa- 
cific Symposium on Biocomputing, 5, 505-516, (2000) 

18. Park, J.C., Kim. H.S., & Kim, J.J.: Bi-directional incremental parsing for automatic path- 
way identification with combinatory categorical grammar, Pacific Symposium on Bio- 
computing 6, 396-407, (2001). 

19. Pusteojovsky, J., Castano, J., Zhang, J., Kotecki, M., & Cochran, B.: Robust relational 
parsing over biomedical literature: Extracting inhibits relations, Pacific Symposium on 
Biocomputing, 7, 362-373, (2002) 

20. Yakushiji, A., Tateisi, Y., Miyao, Y., & Tsujii, J.: Event extraction from biomedical pa- 
pers using a full parser, Pacific Symposium on Biocomputing 6, 408-419 (2001) 

21. Sekimizu. T., Park, H.S., & Tsujii, J.: Identifying the interaction between genes and gene 
products based on frequently seen verbs in Medline abstracts, Proceedings of the work- 
shop on Genome Informatics, 62-71, (1998) 



Text Mining of Full Text Articles 95 



22. Rindflesch, T., Tanabe, L., Weinstein, J., & Hunter, L.: EDGAR: Extraction of drugs, 
genes and relations from the biomedical literature, Pacific Symposium on Biocomputing, 
5,517-528,(2000) 

23. Ng, S-K., & Wong, M.: Towards routine automatic pathway discovery from on-line scien- 
tific text abstracts, Proceedings of the workshop on Genome Informatics, 10, 104-112, 
(1999) 

24. Rzhetsky, A., etc.: GeneWays: a system for extracting, analyzing, visualizing, and inte- 
grating molecular pathway data, Jr of Biomedical Informatics, 37, 43-53, (2004) 

25. Pustejovsky, J., etc.: Medstract: Creating large scale information servers for biomedical 
libraries, ACL-02, Philadelphia, (2002) 

26. Wong, L.: PIES a protein interaction extraction system. Pacific Symposium on Biocom- 
puting, 6, 520-531, (2001) 

27. Schena, M„ Shalon, D., Davis, R.W., & Brown, P.: Quantitative monitoring of gene 
expression patterns with a complementary DNA microarray. Science 270, 467-470 (1995) 

28. DeRisi, J., Iyer, V., & Brown., P.: Exploring the metabolic and genetic control of gene 
expression on a genomic scale. Science 278, 680-686 (1997) 

29. SPSS LexiQuest mine available at http://www.spss.com 

30. GetltRight available at http://www.cthtech.com 

31. LocusLink online gene database available at http://www.ncbi.nlm.nih.gov/locuslink 

32. Genecards online human gene databank available at http://bioinformatics.weizmann.ac.il 
/cards/ 

33. Swissprot senquence database available at http://ca.expasy.org/sprot/ 

34. GoldenPath, Human Genome project at http://www.cse.ucsc.edu/centers/cbe/Genome/ 

35. HUGO Human Genome Organization at http://www.gene.ucl.ac.uk/hugo/ 

36. Chen, P.: The entity-relationship model: Toward a uniform view of data, ACM Transac- 
tions on Database systems, 1(1), 9-36, (1976) 

37. DIP online protein interaction database available at http://dip.doe-mbi.ucla.edu/ 

38. KEGG: Kyoto Encyclopedia of Genes and Genomes 
available at http:// www.genome.ad.jp/kegg/ 

39. Baeza-Yates, R., & Ribeiro-Nato, B.: Modem information retrieval, Addison-Wesley, 
Harlow, UK, (1999) 

40. Mayanil, C.S., D. George, L. Freilich, E.J. Miljan, B. Mania-Farnell, D.G. McLone, and 
E.G. Bremer, Microarray analysis detects novel Pax3 downstream target genes. J Biol 
Chem,. 276(52), 49299-309, (2001) 

41. SPSS Clementine workbench available at http://www.spss.com 



Analysis of Protein/Protein Interactions 
Through Biomedical Literature: 

Text Mining of Abstracts vs. Text Mining 
of Full Text Articles 



Eric P.G. Martin 1 , Eric G. Bremer 2 , Marie-Claude Guerin 1 , 

Catherine DeSesa 3 , and Olivier Jouve 1 

1 SPSS, Tour Europiazza, La Defense 4, F-92925 Paris-la-Defense Cedex, France 
{emartin,mcguerin, o jouve} @spss . com 
2 Children’s Memorial Hospital and Northwestern University, Chicago, IL 60614, USA 
egbremer@northwestern . edu 
3 SPSS, 233 S. Wacker Drive, Chicago, IL 60606, USA 
cdesesa@spss . com 



Abstract. The challenge of knowledge management in the pharmaceutical in- 
dustry is twofold. First it has to address the integration of sequence data with 
the vast and growing body of data from functional analysis of genes with the in- 
formation in huge historical archival databases. Second, as the number of bio- 
medical publications exponentially increases (Medline now contains more than 
13 million records), researchers require assistance in order to broaden their vi- 
sion and comprehension of scientific domains. Analogous to data mining in the 
sense that it uncovers relationships in information, text mining uncovers rela- 
tionships in a text collection and leverages the creativity of the knowledge 
worker in the exploration of these relationships and in the discovery of new 
knowledge. We describe herein a text mining method to automatically detect 
protein interactions which are described across a large amount of scientific pub- 
lications. This method relies on natural language processing to identify protein 
names, their synonyms and the various interactions they can bear with other 
proteins. We have then compared text mining analysis on abstracts to the same 
kind of analysis on full text articles to assess how much information is lost 
when only abstracts are processed. Our results show that: l)LexiQuest Mine is a 
very versatile and accurate tool when mining biomedical literature to analyze 
interactions between proteins. 2)Mining only abstracts can be sufficient and 
time saving for applications that do not require a high level of detail on a large 
scale whereas mining full text articles is to be chosen for more exhaustive ap- 
plications designed to address a specific issue. Availability: LexiQuest Mine is 
available for commercial licensing from SPSS, Inc. 



1 Introduction 

There is a difference between information and knowledge. Information in the phar- 
maceutical industry has exploded in recent years. With the completion of the human 
genome project, a great deal of information is available about what human genes are. 
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Knowledge of how these genes work, in particular how their protein products interact 
to form biochemical pathways and other processes, is much more difficult to assemble 
and understand. In the pharmaceutical industry this knowledge of protein interactions 
and biological pathways is critical to the development of new drugs. 

The number of databases for, and efforts to create controlled vocabularies such as 
the gene ontology (GO) initiative speak to the power of in silico analysis for obtaining 
knowledge about protein interactions and biochemical pathways. However, one of the 
richest sources of information for these types of interactions is the biomedical litera- 
ture. As the number of publications is exponentially increasing (more than 13 million 
entries currently reside in Medline with an anticipated growth of more than 4% a year 
[1]), researchers require the assistance of computational techniques in order to 
broaden their vision and comprehension of protein interactions. Abstracts and full text 
articles are two important sources of biomedical literature. While abstracts are readily 
available they are highly filtered and may be limited in content. Analysis of full text 
articles provides much richer content but can be more difficult to analyze because it 
requires greater computational capacity, additional pre-processing steps to get the 
corpus prepared for further text mining and it is still difficult to retrieve full text arti- 
cles from many information resources. 

Analysis of both abstracts and full text requires identification of the entities of in- 
terest and the more complex extraction of relational information and is an area of 
active investigation. Techniques used for the identification of biological entities (e.g. 
protein, gene, small molecule) include rule-based methods [2, 3], dictionary-based 
methods [4-6], and machine-learning methods [7-10]. The more complex tasks in- 
volving the extraction of relational information have also been examined. These in- 
clude protein interactions [11-16], gene annotations [17], relation between genes and 
drugs [18] and identification of metabolic pathways [19]. The techniques vary from 
using pattern-matching rule libraries [11, 12, 17], to shallow parsing techniques [13, 
14, 16] and to full sentence parsing [16]. Pattern-matching systems rely on the match- 
ing of pre-specified templates (patterns) or rules (such as precedence/following rules 
of specific words). Shallow parsers perform partial decomposition of a sentence struc- 
ture to identify certain phrasal components and extract local dependencies between 
them without reconstructing the structure of the entire sentence. On the other hand, 
full sentence parsers deal with the structure of entire sentences. The reported recall 
and precision of these systems range from 60% to 95%. Most of these examples, 
however, represent a specific type of interaction [15, 17] or small set of data [11, 12, 
17]. For example, Sekimizu and colleagues [15, 17] extract relations associated with 
seven frequently occurring verbs (activate, bind, interact, regulate encode, signal, and 
function) found in Medline abstracts. The precision and recall performance of this 
system ranged from 67.8% to 83.3% depending on the particular verb used. Ono et al. 
[11, 12, 17] have reported a precision of 94% and a recall of 83% for text related to 
the Yeast genome only. Pustejovsky et al. [15, 17] reported a precision of 90% and a 
recall of 57% using a corpus of 500 hand-annotated abstracts for “inhibit” relations 
alone. 

We describe here a text mining method to automatically detect protein interactions 
which are found across a large number of scientific publications. This method relies 
on natural language processing to identify protein names, their synonyms and the 
various interactions they can bear with other proteins. We also wanted to assess any 
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advantages gained by mining full text articles in addition to mining the more readily 
available abstracts alone. In order to accomplish these goals, we decided to analyse 
659 abstracts of scientific articles and to compare them with their corresponding full 
text documents. These abstracts and full length articles allowed us to assess the preci- 
sion and recall of natural language processing with LexiQuest Mine when applied to 
abstracts only and when applied to full text documents to identify semantic relation- 
ships between proteins. 



2 Materials and Methods 

2.1 The DIP Database 

The DIP™ [20] database catalogs experimentally determined interactions between 
proteins. It combines information from a variety of sources to create a single, consis- 
tent set of protein-protein interactions. The data stored within the DIP™ database 
were curated, both, manually by expert curators and also automatically using compu- 
tational approaches that utilize the knowledge about the protein-protein interaction 
networks extracted from the most reliable, core subset of the DIP™ data. 

2.2 Building of the Corpora 

Both abstracts and full text corpora were kindly provided by Dr D. P. Corney, Univer- 
sity College London. Corney et al. [21] first launched a query on DIP to identify arti- 
cles whose record in DIP had two SwissProt identifiers. There were 1434 papers in 
DIP with two SwissProt IDs. BioRAT [22] their own search engine and information 
extraction tool, was then used to locate, retrieve and convert to text the corresponding 
articles. BioRAT found and downloaded 812 articles, of which 153 "failed" during 
PDF->text conversion, leaving 659 full length papers in text format which constitute 
the final corpus. 

During this process, many formatting mistakes were added by the converter like: 

• suppressions of some blank spaces, leading to changes of tokens 

• addition of carriage returns within words or sentences when pdf files are multi- 
column formatted 

Those mistakes were left uncorrected for subsequent text mining. 

This corpus was deliberately heterogeneous so proteins interactions came from dif- 
ferent experimental systems like drosophila, mice, yeast and human cell lines. 

2.3 LexiQuest Mine 2.2 

LexiQuest Mine [23] is a SPSS’ text mining software based on a Natural Language 
Processing (NLP) technology. LexiQuest Mine works by employing a combination of 
dictionary-based linguistics analysis and statistical proximity matching to identify key 
concepts, including multi-word concepts. Then, based on a linguistic analysis of the 
context and semantic nature of the words, it is able to identify their type (organiza- 
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tion, product, etc.) as well as the degree of relationship between them and other con- 
cept. LexiQuest Mine has two operating modes that can be separately or subsequently 
used: 

• Extraction of concepts, based on Part of Speech (PoS) tagging 

• Pattern matching, to identify semantic relationships between known concepts. 



4 



Text Mining 
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Fig. 1 . Scheme of the Text Mining solution; the same core NLP engine is shared by three dif- 
ferent products: LexiQuest Mine, LexiQuest Categorize [24] and Text Mining for Clementine 
[25] 



To perform pattern matching on protein names and interactions, LexiQuest Mine 

relies on: 

• Dictionaries of protein names (i.e. il- 10, nf-kappa b, . . .). 

• Dictionaries of synonyms of protein names (Il-10=interleukin 10; NF- 
kappaB=nuclear factor kappab. 

• Linguistics tags: following protein names, they are used to automatically identify 
unknown proteins (i.e. protein, kinase, phosphatase, ...). 

• Dictionaries of interactions (binds, inhibits, activates, phosphorylates, ...). 

• Dictionaries of synonyms of interactions: only one leading verb is usually dis- 
played at the end of the processing so for instance ‘interaction of protein A with 
protein B’ and ‘protein A interacts with protein B’ will both count for ‘protein A 
interacts with protein B’, where ‘interacts’ is the leading form of ‘interaction’. 

• Patterns, like the one described in fig. 2. A list of 78 patterns has been used during 
this study. 

All dictionaries and patterns are .txt files that can be edited by users. 
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sentence: Exogenous IL-10 inhibits NF-kappaB in monocytes. 
pattern(300)] 

name = (300)_G_AV_G_1 

value = ($VarGene $Verb (the)? $VarGene) 

output = interleukin 10 inhibits nf-kappa b 

Fig. 2. Example of linguistic patterns used for identification of protein/protein interactions: the 
sentence is parsed and matches with pattern #300 in which two genes or proteins are separated 
by an active verb listed in the list of predicates. The output leads to the understanding that: 
Interleukin 10 inhibits nf-kappa b 

LexiQuest Mine identifies any single protein and/or predicate according the prede- 
fined listed rules. For precision and recall measurements, only full PMID-protein A- 
relation-protein B sequences (also described as ‘3-slot results’) were analysed. When 
this sequence was repeated many times within the same document, only the first oc- 
currence was kept for evaluation. 



3 Results 

3.1 Text Mining Optimisation: Working on Dictionaries 

The first goal of this study is to show that text mining tools can correctly analyse most 
of protein/protein interactions described by authors across large amount of scientific 
publications. In order to focus on semantic relationships only, we had to provide 
LexiQuest Mine the correct list of protein names as well as their synonyms. Two 
thesauri were used: GeneOntology and the default one provided with LexiQuest 
Mine, complemented with protein gazetteers of BioRAT. 

All the main kind of protein/protein interactions were first listed by experts and 
then linguistic patterns were tuned to comply with this kind of interactions. Albeit not 
exhaustive, some preliminary tests led us to consider the following list of relation- 
ships which at least evoked some direct or indirect protein interactions (Table 1): 



Table 1. List of the main predicates used to analyze protein interactions (interactions consid- 
ered as non oriented are followed by a *) 



Acetylates 


Bonds* 


Down- 

regulates 


Links* 


Recruits 


Transforms 


Activates 


Complex* 


Forms com- 
plex with* 


Mediates 


Reduces 


Triggers 


Antagonizes 


De- 

activates 


Hydrolyses 


Oligomer- 

izes* 


Regulates 


Ubiquiti- 

nates 


Associates 

with* 


Decreases 


Inactivates 


Over- 

expresses 


Releases* 


Up- 

regulates 


Attenuates 


Degrades 


Increases 


Phosphory- 

lates 


Represses 




Binding to* 


Depho- 

phorylates 


Induces 


Potentiates 


Stimulates 




Binds* 


Dimerizes 


Inhibits 


Precipitates 

with* 


Transactivates 




Blocks 


Dissociates 

from* 


Interacts 

with* 


Reacts with* 


Transduces 
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Each interaction could comply with different patterns of association and negative 
rules. A new rule was also defined to find the ‘is a kind of patterns. 

When existing dictionaries alone were not sufficient to represent all the protein 
names or possible interactions listed through the 659 documents, we have adopted an 
approach using concept extraction prior to pattern matching. 

Partial pattern matching results were also analysed to detect new protein names or 
interactions: LexiQuest Mine delivers ‘3-slot results’ including two name entities 
separated by a relationship. Partial results called ‘2-slot results’ are listed in Table2. 



Table 2. Partial ‘2-slot’ results 



1. 


Protein A 




Protein B 


2. 




Interacts with 


Protein B 


3. 


Protein A 


Interacts with 





From this table it clearly appears that cases 2 and 3 are interesting leads to detect 
some new names of proteins that wouldn’t have been recognized whereas case 1 is 
interesting to detect previously non listed interactions. As an example, case 1 usually 
matches lists of proteins but can also match this kind of sentence: 

together with our finding of coimmimoprecipitation of <*act3p*> with histone 
<*h2a*> , this suggests the in vivo existence of a protein complex required for cor- 
rect expression of particular genes. 

This is a great help in adding “coimmunoprecipiation” as another way to describe 
an interaction between two proteins. 

The same set of patterns is used for both genes and proteins. The difference be- 
tween genes and proteins is done according dictionaries of named entities when pos- 
sible and by analysis of predicates used within these patterns. For instance, all predi- 
cates describing enzymatic activities (i.e. phosphorylates, hydrolises, ...) are very 
specific of proteins. Most of the binding activities described in the corpus were asso- 
ciated to proteins even if some DNA-binding proteins were also found. 

3.2 Automated Detection of Protein/Protein Interactions 

Because the list of the 659 studied articles was retrieved from DIP™, every full text 
document was known to contain protein interactions. After the text mining process- 
ing, documents identified to contain at least one 3-slot result were considered as 
matching (Table 3). Results show that less than 20% of full text articles were ignored 
by LexiQuest Mine. Albeit the total amount of abstracts represented 3.3% of the 
length of the whole full text corpus, more than 53% of abstracts matched. 



Table 3. Proportion of matching documents 





Total number 


Proportion of matching 
documents 


Abstracts only 


659 


53.41% 


Full text articles 


659 


81.49% 
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The subsequent analyses were only performed on this matching part of the corpus. 
The quantitative analysis of matching records is given in Table 4. The high proportion 
of unique matches reflects the high level of heterogeneity of the corpus. As expected, 
matching abstracts don’t contain numerous protein/protein interactions because they 
generally condense the most remarkable results. On the other hand, full text articles 
bring much more interactions but this dramatically varies from one article to another 
one. 



Table 4. Statistics on matching records Unique matches correspond to a unique sequence of: 
PMID - protein A - relation- protein B 





Total # of 
matches 


Proportion of 
unique matches 


Average # of 
matches per 
document 


Standard devia- 
tion of matches 
per document 


Abstracts only 


853 


92.26% 


2.42 


1.65 


Full text articles 


8098 


60.21% 


15.08 


14.11 



Precision and Recall are the two well known metrics used for information retrieval 
tools to evaluate the quality of softwares. They can be defined as follow: 

_ Good matches 

Pr ecision = = 

( Good _ matches + False _ matches) 

_ „ Good matches 

Re call = = 

( Good _ matches + Missed _ matches) 

Because it would take too long to manually review every abstract or full text article 
to measure precision and recall a random sample of 32 documents was chosen for 
measurement of precision and recall. The same statistics as in Table 4 were then cal- 
culated to assess the quality of this sample (Table 5). 



Table 5. Statistics about the sample. Unique matches correspond to a unique sequence of: 
PMID - protein A - relation- protein B 





Total # of 
matches 


Proportion of 
unique matches 


Average # of 
matches per 
document 


Standard devia- 
tion of matches 
per document 


Abstracts only 


91 


81.32% 


2.75 


2.45 


Full text articles 


796 


59.80% 


24.37 


16.88 



The sample was considered as statistically representative of the whole corpus, even 
if it shows an average number of matches per document slightly superior to the origi- 
nal corpus and a weaker proportion of unique matches on abstracts. 

Each document was manually parsed and manual results were then compared to 
text mining results to evaluate precision and recall of our system (Table 6). 

Results show a great level of precision on abstracts. This is interesting to note that 
despite the formatting issues and the high heterogeneity found within full text articles, 
our linguistic patterns are specific and robust enough to preserve precision. 
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Table 6. Analysis of precision and recall from the sample 





Number of 
documents 


Precision (with confi- 
dence interval for 
p=0.05) 


Recall (with confidence 
interval for p=0.05) 


Abstracts 

sample 


32 


92.5% (9.1%) 


66.1% (16.4%) 


Full text arti- 
cles sample 


32 


82.8% (13.1%) 


64.1% (16.6%) 



Recall has been found to be high, even if more manual reviewing would be re- 
quired to decrease the confidence interval of those results. The level of recall was 
found to be very similar between abstracts and full text articles. 

Recall is impacted by either bad document formatting, poor dictionaries of named 
entities, incomplete lists of patterns, or patterns that do not fit well given the way 
authors describe protein interactions. To assess the reasons why about 35% of rela- 
tionships were missed, additional manual work was performed; results are shown in 
Table 7. 

Table 7. Analysis of missed relationships (one missed interaction can result from several cau- 
ses and so be counted in several columns) 





Reason for missing relationship 


Dictionaries 


Patterns 


Formatting 


Abstracts only sample 


65.79% 


47.35% 


10.52% 


Full text articles sample 


64.48% 


50.82% 


13.66% 



Even if formatting of full text articles was really worse than that of abstracts alone, 
it surprisingly did not add much more silence, suggesting that our system is relatively 
tolerant to poorly formatted documents. 

No real difference was finally observed between abstracts and full text articles: 
lack of protein identification comes first (about 65%), followed by non-matching 
patterns (about 50%). 



4 Discussion 

4.1 NLP Text Mining to Analyse Protein Interactions: 

Limitations and Improvements 

Obtaining better results from full text documents was impaired by the poor quality of 
all full text articles after they had been converted from pdf to txt. The best solution 
would have been to retrieve full text articles as xml documents since LexiQuest Mine 
is able to selectively process xml tags. Utilization of xml tags would have disallowed 
analysing relationships coming from quoted references, which were not specific to the 
analysed articles. Hence, pre-processing steps should improve the overall quality of 
results. 

Even if many interactions were not detected because of the complexity of the way 
they were formulated by authors, working at the sentence level allowed our system to 
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be redundant enough to finally identify most of the interactions. Nevertheless, this 
method did not add much noise since 92% of the identified interactions within ab- 
stracts were unique. 

Compared to the typical statistical approaches like Bayesian Networks, the NLP 
approaches present many advantages: 

• The processing is much faster and so allows for high throughput text mining appli- 
cations at a corporate level. 

• The level of precision is much higher since interactions are analysed via NLP and 
not only statistically matched. 

• This precision comes with the ability to easily tune the granularity of those interac- 
tions: i.e. for some applications any binding between proteins might be interesting 
whereas other applications may be concentrated on phosphorylations only. For this 
reason, we usually prefer not to aggregate all forms of processing at the NLP level, 
but rather, we consider that this fine tuning has to be ruled by the final bioinformat- 
ics application. 

However, using a NLP engine requires properly setting the linguistic resources to 
be used: dictionaries of entities to be analysed, patterns to describe their relationships 
and dictionaries of synonyms. This becomes harder as the number of entities to be 
analysed grows. Even if some protein nomenclatures have been developed, they are 
not commonly employed by the scientific community, which uses very different ex- 
perimental models and is composed of many different scientific backgrounds [26]. 

Automatic identification of protein names has been investigated for many years [2, 
26, 27], In the present study, our system analysed proteins according dictionaries. 
Those dictionaries contain proteins full names, synonyms and useful protein linguistic 
tags (like protein, kinase, phosphatase, ...). Analysis of missed interactions shows that 
this step is the main bottleneck of the whole process. We have very recently added the 
ability to define named entities via the use of regular expressions. We believe the 
combination of these two approaches will improve recall without dramatically ham- 
pering precision. 

Many other systems have already been used to analyse protein interactions [11, 12, 
21, 28-33]. Corney et al. worked on the same data with BioRAT [21] and compared 
their results with SUISEKI (System for Information Extraction on Interactions)[28]. 
Both systems had a recall of 20-22% on abstracts only whereas BioRAT presented a 
43% recall and a 50% precision on full text articles. Albeit this latter work used a 
larger sample of documents for manual reviewing (more than 200 vs. 32), the most 
pessimistic values within our confidence interval would still give about 80% precision 
and 70% recall on abstracts, and 70% precision and 50% recall on full text articles. 
Those differences may be explained by: 

• A different NLP core engine, more sophisticated than the GATE[34] toolbox used 
by BioRAT 

• More linguistic patterns (78 in LexiQuest Mine vs. 19 in BioRAT) to describe 
protein interactions, which take into account coordination and negation for instance 

• A semi-automatic optimisation of the list of named entities thanks to analysis of ‘2- 
slot’ outputs. 
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4.2 Abstracts Analysis vs. Full Text Analysis: A Wrong Opposition 

Mining full text articles obviously provided many more interactions than mining ab- 
stracts alone, but we did not obtain any significant differences in terms of precision 
and recall like Corney el al. [21]. Moreover, the analysis of missed interactions re- 
vealed the same ratios of dictionaries, patterns and formatting failures. 

Compared to full text articles, abstracts contain less information but present many 
advantages: 

• They can easily be found as txt files, so no errors occur as during conversion of pdf 
documents. 

• Information is refined: only remarkable results are usually mentioned. 

• Relationships are usually expressed in a simpler way so they are more easily rec- 
ognized by a NLP text mining tool. 

• They are accessible for free. 

• Their processing is much faster (in this study their size represented about 3.3% of 
the full text corpus). 

So we believe that abstracts have to be chosen when building high scale text min- 
ing applications that are not required to be exhaustive, but that need to quickly ana- 
lyse numerous sources of information with a very good level of precision. 

On the other hand, analysis of full text articles has to be chosen when: 

• Having access to full text articles is possible 

• The number of studied entities or relationships is low and well delimited so that a 
high level of precision is obtained without spending a long time on dictionary tun- 
ing. 

• Experimental information has a great value, i.e. controversial coimmunoprecipita- 
tion results can be easily compared by looking at experimental conditions (antibod- 
ies, tissues, methods ...) and determining the reasons for differences. 

• Mining for elusive results, or results that contain names, or interactions which are 
not already well known through the scientific community. 

At a corporate level, we envisage that both kinds of applications will be deployed 
over a large scale in the near future, making NLP text mining a common R&D tool 
used on a daily basis for some basic applications whereas more sophisticated tuning 
will be found in applications designed to address very specific issues. 

4.3 The Tremendous Impact of a Mixed Data/Text Mining Approach 

Much of the large-scale data mining and analysis of genomic and proteomic data to 
date has focused on expression patterns and in particular on establishing clusters 
based on expression data [35-38]. These methods can provide insight into expression 
correlations but do not provide much information about functional relationships. 
Functional relationships can be complex and not related to expression pattern as seen 
in the following examples. First, functionally related genes may play antagonistic 
roles within a pathway and thus show an anti-correlation in their expression pattern. 
Second, genes or proteins with similar expression profiles may be involved in distinct 
biological processes. Third, genes or proteins may play multiple roles in complex 
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interrelated processes that involve entities from several clusters. The functional rela- 
tionships among genes/proteins cannot be determined by cluster analysis alone and 
explaining the formed clusters requires extensive further analysis. 

The information needed for this type of functional analysis can often be found in 
the published literature. The example of text mining for protein-protein interactions 
described in this communication presents an approach to creating an information base 
of protein associations that can then be mined along with expression data, drug inter- 
action data, or clinical data. Data mining process, when applied to this entire informa- 
tion base can then provide knowledge about the functional relationships. 



5 Conclusion 

With results above 80% precision and 60% recall, our system showed that there was a 
very low rate of false positive interactions with this text mining approach on both 
abstracts and full text. Most of the well known protein associations could be identified 
from abstracts alone. Mining full text articles, on the other hand, greatly enhances the 
number of identified interactions with a limited impact on precision. 

Albeit we were focused on protein/protein interactions, LexiQuest Mine can be 
used to analyse relationships between any type of named entities. The next version of 
LexiQuest Mine will soon enable users to analyse interactions among different named 
entities at the same time, such as relationships between chemical compounds, gene or 
proteins, pathologies and tissues. This will provide databases with great added value, 
taking into account results coming from both biomedical literature and in-house as- 
says. In a post Genomics era, these databases will probably be the next goldmines for 
data mining to discover new risks, associations and provide a better understanding of 
both molecular and disease mechanisms. 

As LexiQuest Mine is already successfully used by many bioinformatics depart- 
ments of both big academic and public laboratories, we do believe this is the right 
approach to building ‘high throughput text mining’ platforms that will efficiently 
support any kind of new application in bioinformatics in the future. 
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Abstract. In this paper we investigate several hypotheses concerning 
document relevance ranking for biological literature. More specifically, we 
focus on three topics: performance, risk of local searching, and homonymy 
recognition. Surprisingly, we find that a quite simple ranker based on 
the occurrence of a single word performs best. Adding this word as a 
new search term to each query yields results comparable to elaborate 
state-of-the-art approaches. The risk of our local searching approach is 
found to be negligible. In some cases retrieval from a large repository 
even yields worse results than local search on a smaller repository which 
only contains documents returned by the current query. The removal of 
automatically determined homonyms yields almost indistinguishable re- 
sults to the original query, so it is not inconceivable that the problem of 
homonymy in biological literature has been overstated. Concluding, our 
investigation of three hypotheses has been useful to decide implemen- 
tation issues within our research projects as well as opening interesting 
venues for further research. 



1 Introduction 

Genome research has spawned unprecedented volumes of data, but characteri- 
zation of DNA and protein sequences has not kept pace with the rate of data 
acquisition. To anyone trying to know more about a given sequence, the world- 
wide collection of abstract and papers remains the ultimate information source. 
The goal of the BioMinT 1 project is to develop a generic text mining tool that 
(1) interprets diverse types of query, (2) retrieves relevant documents from the 
biological literature, (3) extracts the required information, and (4) outputs the 
result as a database slot filler or as a structured report. The BioMinT tool will 
thus operate in two modes. As a curator’s assistant, it will be validated on Swis- 
sProt 2 and PRINTS 3 ; as a researcher’s assistant, its reports will submitted to 

1 Biological Textmining, EU FP5 QoL project no. QLRI-CT-2002-02770. 

2 SwissProt is a popular database on protein sequence, function and other features. 
See [2], 

3 The PRINTS resource is a compendium of protein fingerprints. A fingerprint is a 
group of conserved motifs which characterise protein families and may be used to 
infer the function of an unknown protein. See [1]. 
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the scrutiny of biologists in academia and industry. The project is conducted by 
an interdisciplinary team from biology, computational linguistics, and data/text 
mining. 

Within this paper, we focus exclusively on the first mode of the BioMinT tool 
as the curator’s assistant. More specifically, our evaluation focusses on document 
relevance ranking for medical annotation of Homo sapiens within SwissProt. 
Document relevance ranking is an important task within our project, so we have 
implemented a set of rankers for this task. 

Ranking of documents by relevance is a well-researched topic. For example, 
the Text Retrieval Conferences are organized by the U.S. National Institute of 
Standards & Technology 4 on a yearly basis and are organized in the form of 
competitions, yielding dozens of publications per year. Other papers include [3, 
7, 10] . A comprehensive overview on the growing field of biological literature text 
mining can be found in [5] . 

In this paper we investigate the performance of diverse ranking systems 
within the BioMinT prototype. We consider both query-independent ranking 
systems, i.e. those which are trained on a corpus of relevant and irrelevant doc- 
uments from the same topic and where the score 5 of any one document only 
depends on its content; and query-dependent ranking systems, i.e. those where 
the score is dependent on document contents and query while no training cor- 
pus is used. The latter correspond roughly to general search engines such as 
Google, Yahoo and the infamous Windows search function while the former cor- 
responds to the machine learning approach to ranking: learn a model to predict 
relevance from training data and apply it to generate a ranking 6 . An excellent 
result of a quite simple ranker prompted us to combine these two approaches in 
a straightforward manner, improving the result further. We intend to address 
the combination of these ranking methodologies more comprehensively in the 
future. 

Furthermore, we will also estimate the risk of our local searching approach. 
That is, we aim to answer whether it would be preferable to have the full MED- 
LINE 7 database available directly instead of sending queries to PubMed 8 , and 
ranking only the retrieved document references - which is our current local 
searching approach 9 . 

4 More details and publications see trec.nist.gov 

5 The ranking is simply the sorted list of documents according to score, i.e. the first 
document has the highest score, the second has the second-highest score and so on. 

6 This implicitly assumes that a model which predicts relevance sufficiently well is 
also a good ranker. This is not always the case for example, in our experiments we 
found that models which perform worse at prediction may perform better at ranking 
and vice versa. 

7 MEDLINE is a large comprehensive repository of around twelve million bibliographic 
references, managed by the U.S. National Center for Biotechnology Information 
(NCBI). A few thousand references are added daily from a variety of sources. 

8 www.pubmed.org, managed by the U.S. National Center for Biotechnology Informa- 
tion (NCBI), is a retrieval engine for the MEDLINE database. 

9 It should be mentioned that local searching as defined here may be a misnomer. 
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Lastly, we report an investigation on the usefulness of homonymy recognition 
and find the recognition to work well but yield no improvement in terms of 
ranking performance. 

Our experiments are based on the medical annotation dataset by the Swiss 
Institute of Bioinformatics which has been the subject of previous papers, e.g. [3] . 

2 Motivation 

A web-based prototype system for the BioMinT project is in development at 
our institute. It currently offers the functionality to expand query terms via 
synonyms extracted from fourteen online databases (see Section on Synonym 
Expansion); to retrieve bibliographic references from MEDLINE via the online 
PubMed search engine; and to process these collection to create ranked document 
listings via diverse ranking algorithms, four of which have been chosen for our 
experiments here. 

We were at first interested in investigating the relative performance of our 
rankers, and especially in the strengths and weaknesses of query-dependent clas- 
sical ranking algorithms (i.e. those who determine relevance by computing simi- 
larity with the query) versus query-independent learning rankers (i.e. those who 
determine relevance by generalizing from a given collection of relevant and non- 
relevant documents, not taking the specific query to be answered into account). 

Due to licensing problems it is at present moderately difficult to maintain a 
local snapshot of the full MEDLINE database even for research purposes; and al- 
most impossible in a commercial setting, at least for non-U. S. companies. So our 
current approach is to extend the query with synonym expansion (=oversearch- 
ing), send the query to PubMed and afterwards locally postprocess the retrieved 
document set with filtering and ranking approaches. We call our approach local 
search. The second main point of this work is thus to determine whether this 
approach is competitive to having the full MEDLINE database locally indexed, 
and if not how high the risk is to us it in the future - in terms of significant 
differences in ranking performance. 

Lastly, word sense disambiguation in biology remains challenging (see e.g. 
[11]). Related to this is the problem of homonymy - a single protein/gene name 
may refer to multiple protein/gene entities. While we have insufficient data for 
proper word-sense disambiguation, it is still possible to investigate whether re- 
moval of homonyms from the query improves the ranking. This is what we inves- 
tigated in the third and final experiment. Our current synonym database suggests 
a simple way to recognize homonyms which has been preliminarily validated by 
domain experts and looks very promising. 

These are the three main topics for this work. We will now proceed to explain 
the synonym expansion process and describe the medical annotation dataset, fol- 
lowed by experimental setup and finally - our experimental results concerning 
these three topics. 
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3 Synonym Expansion 

The synonym expansion is based on a composite database of protein and gene 
names and synonyms created from fourteen online databases. Here, we focus on 
species Homo sapiens, so only five databases are relevant: SwissProt[2], Genew[9], 
OMIM[8], LocusLink 10 and GDB 11 . 

As general procedure we have extracted all name from appropriate fields of a 
given entry and created all pairwise combinations of these synonyms, combined 
with species and source information, as separate entries in the synonym tables. 
This procedure mainly relies on synonym information being symmetric (i.e. if A 
is a synonym of B, then B is a synonym of A) and on the synonyms from a given 
entry being concerned with the same gene or protein. 

Several databases contain references, or links, to corresponding entries in 
other databases. Only links where both endpoints exist, and which refer to the 
same species, are processed. This additional information is also integrated into 
the database as follows. 

We assume that database links are symmetric and transitive. Both should be 
instantly obvious from the fact that entries are linked only when they refer to 
the very same gene or protein. We have accounted for this fact by considering 
all links symmetric and extending the link structure by transitive closure. Thus, 
two entries are linked if and only if there is a path of length greater than zero 
between them in the link graph. 

An alternative view of this process is that we partition the link graph into 
distinct subgraphs, each of which is not connected to any other subgraph. The 
entries within each subgraph are connected by link paths of arbitrary length. We 
have called each subgraph a synonym group, and assigned an unique number to 
it. As a special case, single database entries without associated links are also 
considered a synonym group and assigned an unique number. We then consider 
all entries in each group to be part of a super-entry, consisting of names from 
all those entries; and extend the database with these new synonym pairs. 

During the creation of the synonym database we noticed quite a few unusual 
protein names, which have been brought to the attention of domain experts. 
These have recently provided cleaning rules which are applied after each entry 
has been processed. 

The current release of the synonym database was updated on 7tlr of June 
and contains 501,866 unique names; 11,277,791 unique synonym pairs (including 
source database, id and source field data) and 329,257 unique synonym groups 
from a total of 7,395 unique species. 



10 http://www.ncbi.nlm.nih.gov/LocusLink/, a composite database managed by the 
U.S. National Center for Biotechnology Information. 

11 www . gdb . org, GDB Human Genome Database (GDB) was developed and maintained 
by the The Hospital For Sick Children, Toronto, Ontario, Canada (1998 - 2002), and 
Johns Hopkins University, Baltimore Maryland, United States of America (1990- 
2002). In January 2003, GDB-related software and public data were transferred to 
RTI International. RTI continues to host GDB as an open, public resource. 
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4 Medical Annotation Dataset 

The medical annotation dataset is concerned with the relevance of documents 
encountered during annotation of thirty- two genes for SwissProt. It contains 
32 queries and 2,188 documents classified as Good (relevant), Bad (irrelevant) 
or Unclear (insufficient data to determine relevance). We removed documents 
classified as Unclear since the task of determining whether insufficient data for 
relevance determination exists is orders of magnitude harder and not as interest- 
ing to study as the task of learning models for known relevance. This approach 
is equivalent to assuming a missing class value for Unclear. Thus we follow the 
TREC 12 methodology in that we assume relevance to be a binary value, which 
greatly facilitates evaluation and comparison of different approaches. 1,834 doc- 
uments remain after removal of class Unclear, of which 20% are assigned to class 
Good (relevant). Specific information on the queries is shown in Table 1. 

We have arbitrarily chosen nine queries from the medical annotation dataset 
for testing, and the others for training. Of the remaining 23 queries, we removed 
two because they did not contain any relevant documents; and one because 
its query cannot be represented in our current prototype since it contains a 
negation 13 . 

The name of each query is equivalent to a gene name, which in turn refers 
to the main search term used for a PubMed query by the annotators. We ex- 
panded the main search term via synonym expansion, adding all unique names 
from within all synonym groups that contain the search term. We restricted the 
search to Homo sapiens. We also added the six search terms (mutation mutations 
variant variants polymorphism polymorphisms) to recreate the original queries 
as closely as possible. 

Since about a year has gone by since the creation of the original medical 
annotation dataset, it is not surprising that most queries now return more docu- 
ments (V #QDocs>#Docs). Since the relevance of new documents is not known 
to us, we have chosen to evaluate mainly those documents whose relevance is 
known from the medical annotation database - except for Avg. rel.Rank which 
is computed over the current query (see Evaluation Measures). 

5 Experimental Setup 

5.1 Query Construction 

As we mentioned earlier, each query was constructed from the expanded main 
search term, which corresponds to the query name. Synonym expansion is done 
by looking up all synonyms in our synonym database for species Homo sapi- 
ens. Additionally, the following search terms were added: (mutation mutations 

12 TREC is a series of yearly Text REtrieval Conferences organized by the U.S. Na- 
tional Institute of Standards and Technology, see trec.nist.gov. The TREC con- 
ferences have been centered around specific text mining problems from the beginning 
in 1992, always in a competitive setting. 

13 It is quite feasible to extend this, but this feature has not yet been implemented. 
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| Query 


#QDocs 


#Docs #RDocs| 


) Test, for all rankers j 


wtl 


660 


128 


17 


ump synthase 


146 


13 


1 


xpa 


993 


131 


1 


vhl 


590 


299 


58 


wrn 


247 


137 


9 


xpc 


199 


50 


2 


wfsl 


93 


17 


11 


GCDH 


77 


11 


9 


tulpl 


24 


17 


1 


Train, for 


query-independent rankers only 




ADRBT 


116 


1 


0 


CDH1 


744 


32 


4 


ESR1 


3709 


80 


3 


GLB1 


3401 


4 


1 


LPL 


789 


234 


65 


MRP1 


1560 


49 


3 


abcbl 


949 


100 


9 


mrp2 


839 


18 


4 


mrp6 


669 


14 


10 


surl 


2241 


78 


14 


tgfbr2 


11349 


11 


1 


tgml 


6640 


22 


11 


tpo not thrombopoietin* 403 


66 


9 


triosephosphate isomerase 312 


101 


16 


tscl 


867 


112 


6 


umps* 


146 


6 


0 


urod 


270 


10 


6 


uroporphyrinogen-III synthase 41 


28 


17 


vdr 


1015 


36 


9 


vmd2 


4519 


9 


6 


whn 


33 


27 


1 


zap70 


8680 


7 


1 


zic3 


34 


7 


1 



Table 1. Medical annotation dataset 
partitioned into training and test 
queries. #QDocs, returned docu- 
ments for expanded query as of July 
2004; #Docs, documents in original 
query as of 2003; #RDocs, relevant 
documents in original query. * de- 
notes removed queries, see text. 



variant variants polymorphism polymorphisms). Query terms were enclosed in 
double quotes (”). Other search terms were similarily treated and concatenated 
to the original query via AND. For example, the final query for tulpl was: 

("RP14" OR "tubby like protein 1" OR "Tubby related protein 1" 

OR "Tubby-like protein 1" OR "Tubby-like protein-1" OR "TUBL1" 

OR "TULP1") AND ("mutation" OR "mutations" OR "variant" OR 

"variants" OR "polymorphisms" OR "polymorphism") 

The final query was then sent to the online PubMed search engine and all docu- 
ments were retrieved and processed by the rankers which we shall now describe 
in turn. 

5.2 Rankers 

We chose four rankers for our evaluation - two classic ranking systems which 
assign scores to documents based on their similarity to the query (LR and SR), 
and two learning systems who try to learn scores for documents based solely 
on their content without reference to the query (NBR, ORR). The former are 
called query-dependent and the latter query-independent rankers because of this 
important distinction. The query-independent rankers utilize documents from all 
training queries for learning while this source of information is not used by the 
query-dependent rankers. For the query terms, it is vice versa. R.ND is a special 
case which estimates query complexity as the performance of a random ranker. 

— LuceneRanker (LR) utilizes the java-based text indexing and retrieval engine 
Jakarta Lucene 14 for ranking. The high performance of this engine allows us 

14 http : // j akarta . apache . org/lucene 
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to index all given query documents on-the-fly and search the index for the 
final query. Lucene has also been used as competitive baseline for the TREC 
2003 Genomics Track competition. 

Lucene uses the following formula to compute the score for each document. 
No term boosting was used: Vi : boostt = 1. 



where 



scored = 



coord q d ^ t f'i 

t 



idft idft 

tjd 

norm q norrridt 



boostt 



(1) 



scored 

coord q d 

tfq 

idft 



norm q 

tfd 

normdt 

boostt. 



score for document d 

number of terms in both query and document 
divided by number of terms in query 
the square root of the frequency of t in the query 
, numDocs , „ 

'° g d^F^, + 1 + 10 

numDocs = number of documents in index 

docFreqt = number of documents containing t 

dfqidft ) 2 

the square root of the frequency of t in cl 
sqrt number of tokens in d and same field as t 
the user-specified boost for term t 



(2) 

( 3 ) 

( 4 ) 

( 5 ) 

(6) 

( 7 ) 

(8) 

(9) 

(10) 

( 11 ) 



A variant of LR, LucenelndexR anker (LIR) will be used for evaluating local 
search. The only difference is that LIR uses a one- year snapshot of MEDLINE 
as background database and adds the documents from the local query to 
this index before searching this larger index. This simulates what the search 
would be like if we were to create an index on the full MEDLINE database. 

SimpleRanker (SR) ranks by a simplified score. For each document, it com- 
putes the proportion of query terms which actually appear in the document. 
This ranker is intended to serve as a simplified baseline to the more refined 
score computation by LR, but has the advantage that documents can be 
ranked instantly without having to wait until the full document collection 
becomes available. The latter is necessary for LR as all documents are needed 
to build the full-text index. 

NaiveBayesRanker (NBR) utilizes a pre-trained Naive Bayes classifier for 
ranking. Naive Bayes is a common machine learning algorithm based on 
Bayes’ Rule, see [4] . Probabilistic classifiers like Naive Bayes have been shown 
to tackle the problem of document relevance ranking for biomedical literature 
successfully, see [7, 10]. Title and abstract of each document are transformed 
into a bag-of-words representation prior to processing, where each word is 
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represented by one attribute encoding binary occurrence. The top 1000 most 
frequent words appearing in documents from the training queries plus the 
document’s classification as Good or Bad were used for training this classi- 
fier. No stemming, lexical preprocessing, or normalization took place. NBR 
outputs a numeric score, i.e. the probability of a document being relevant, 
estimated from the training data. 

— OneRRanker (ORR) also utilizes a pre-trained model like NBR. However, 
the model by ORR is much simpler and consists of one rule based on a single 
word - the word which yields most information on the class, see [6]. In our 
case, the rule obtained from training data was: 

IF ’missense’ appears in document THEN score=1.0 (relevant) 

ELSE score=0.0 (irrelevant) 

Contrary to NBR, this ranker can only output binary scores (either 0 or 
1) which means that all documents considered relevant or irrelevant will be 
output in input order, i.e. reverse chronological which is usually not well 
correlated to any reasonable relevance order 15 . This is usually not a good 
property for a ranker since this means it cannot hedge its bets - all relevant 
documents are considered equal as are all irrelevant documents. Ordering 
within the set of relevant/irrelevant documents is given by the document 
source, and not by the ranker. Exactly the same training data as for NBR 
was used here. 

— RandomR anker (RND) is an even simpler baseline than SR. It shuffles the 
input documents randomly, corresponding to assigning a random numeric 
score to each document - independent of its contents! RND is obviously 
not suited for meaningful ranking, but measures the complexity of each test 
query queries with mostly relevant documents will perform quite well with 
this approach, while queries with few or only one relevant document will 
fare poorly. It should be easy to beat RND, which is indeed the case. The 
measures for RND have been averaged over 1000 runs for each query to yield 
more stable estimates. 

All rankers and supplementary tools are written in Java which facilitated inter- 
operability. We also removed nine documents which were present in both training 
and test queries since this may lead us to overestimate the performance of the 
system 16 . 

15 This is due to the search via PubMed which returns documents in reverse chrono- 
logical order (i.e. the newest document is on top) 

16 In unpublished experiments, changes from 0.74 to 0.64 in average precision were 
observed when removing overlapping documents. There are two viewpoints on this: 
One, we might assume that some overlap between queries is realistic and do not 
bother. Two, we might ensure that training and test set are not overlapping to 
prevent such overestimation of performance. We have chosen the second approach 
since an overlap of nine documents for nine test queries is quite significant and - 
given that three of the queries have only one relevant document - could potentially 
bias results dramatically. So we chose to err on the side of caution. 
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5.3 Evaluation 

We have chosen three measures for evaluating our rankers. Let us assume that 
the ranked documents are numbered as di,d%,d 3 , ...d n in ranked order, where 
d\ is predicted to be most relevant. Let the index of the ith relevant document 
be ti and the total number of relevant documents be R. Let Ri be the number 
of relevant documents which are returned before or at i, i.e. the size of the set 
<= i}. Then Prec(i) = ^ , Recall(i) = ^ are precision and recall after 
returning the ith document. 

— Average precision (Avg.Prec), i.e. the mean of precision at each relevant 

document retrieved (defined as in TREC: Prec(ri)). 

— Precision-Recall break even point (PRBE), i.e. the point within the ranking 

where precision equals recall (Prec(z) == Recall(i) for any i). If it is not 
defined, we chose to take the i in the ranking where precision and recall 
differ least (=i m inDif /), and computed the average of recall and precision 
there: . c on trary to average precision which is 

hard to interpret, this is a real recall/precision value which can actually be 
achieved by cutoff at i m inDi / / • 

— Average relative rank (Avg.rel.Rank) in the current query from July 2004. 
Here, the rank of all relevant documents within the larger and more current 
query is averaged and normalized to the number of returned documents 
(#QDocs in Table 1). This value tells us how the relevant documents are 
distributed in the present query and roughly at which place we would expect 
a known relevant document to appear on average. Because we do not know 
how many of the documents ranked before known relevant documents are 
also relevant, this may be of limited use. Here, smaller values indicate better 
performance. 

For each comparison, all three values are reported. Also, arithmetic average of 
Average precision , PRBE and Avg. rel.Rank over the nine test queries is reported 
throughout. We chose not to use standard deviation over Average precision be- 
cause it is not usually used in TREC evaluation. Also, the queries are of quite 
different complexity (see RND) so a proper normalization procedure would have 
to be devised. It is not obvious how to achieve this in a fair manner. 



6 Results 

6.1 Ranking Comparison 

The results of the ranking can be found in Tables 2, 3 and 4. Surprisingly, the 
simplest query-independent ranker ORR is best both on mean Avg.Prec and on 
mean PRBE; NBR wins on mean Avg.rel.Rank which is a less reliable indicator 
of performance. Generally, query-dependent approaches (LR and SR) perform 
satisfactorily given that they did not use any information except the query terms 
themselves but there is a slight performance gap. 
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Table 2. Average precision. Avg., mean average precision over all queries. 





LR SR NBR ORRRND 


LRJV1 


wtl 


0.199 0.164 0.267 0.366 0.165 


0.364 


ump s. 


1.000 1.000 0.333 1.000 0.245 


1.000 


xpa 


0.500 0.333 0.500 0.019 0.043 


0.250 


vhl 


0.449 0.407 0.617 0.677 0.209 


0.604 


wrn 


0.698 0.438 0.462 0.282 0.096 


0.699 


xpc 


0.292 0.171 0.500 0.559 0.106 


0.700 


wfsl 


0.874 0.884 0.930 0.873 0.700 


0.907 


GCDH 


0.977 1.000 0.878 0.792 0.863 


0.977 


tulpl 


0.091 0.111 0.333 1.000 0.211 


0.100 


Avg. 


0.564 0.501 0.536 0.619 0.293 


0.622 



Table 3. Precision-Recall break even point. Avg., mean of PRBE. 





LR SR NBR ORRRND 


LR_M 


wtl 


0.118 0.118 0.294 0.353 0.135 


0.353 


ump s. 


1.000 1.000 0.667 1.000 0.629 


1.000 


xpa 


0.750 0.667 0.750 0.510 0.521 


0.625 


vhl 


0.414 0.431 0.586 0.690 0.192 


0.534 


wrn 


0.667 0.333 0.444 0.333 0.109 


0.556 


xpc 


0.375 0.350 0.500 0.500 0.311 


0.500 


wfsl 


0.727 0.727 0.818 0.727 0.647 


0.727 


GCDH 


0.889 1.000 0.889 0.778 0.814 


0.889 


tulpl 


0.545 0.556 0.667 1.000 0.594 


0.550 


Avg. 


0.549 0.518 0.562 0.589 0.395 


0.637 



Table 4. Average relative rank, i.e. average rank of relevant documents in current 
query from July 2004 normalized by the number of documents returned. Avg., mean 
of Avg. rel.Rank over all queries. 





LR SR NBR ORR 


LR_M 


wtl 


0.410 0.520 0.283 0.497 


0.288 


ump s. 


0.026 0.007 0.027 0.075 


0.026 


xpa 


0.031 0.018 0.023 0.912 


0.053 


vhl 


0.319 0.312 0.225 0.234 


0.213 


wrn 


0.098 0.220 0.126 0.474 


0.113 


xpc 


0.101 0.186 0.070 0.460 


0.054 


wfsl 


0.268 0.316 0.185 0.428 


0.252 


GCDH 


0.142 0.091 0.228 0.707 


0.166 


tulpl 


0.619 0.375 0.167 0.042 


0.524 


Avg. 


0.224 0.227 0.148 0.425 


0.188 



The excellent result of ORR inspired us to try another experiment, shown in 
the last column of the results tables as LR_M. Here, we simply added the single 
term ‘missense’ (= the word learned by ORR) to the queries and reran LR. This 
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Table 5. Performance comparison between LuceneRanker (LR) and Lucenelndex 
Ranker (LIR). Avg.Pr., Average precision; PRBE, Precision-Recall Breakeven point; 
Avg. rel.Rank, average rank of relevant documents in current query from July 2004 
normalized by the number of documents returned. 





Avg.Pr. 
LR LIR 


PRBE 
LR LIR 


Avg. rel.Rank 
LR LIR 


wtl 


0.199 0.236 


0.118 0.176 


0.410 


0.344 


ump s. 


1.000 1.000 


1.000 1.000 


0.026 


0.026 


xpa 


0.500 0.143 


0.750 0.571 


0.031 


0.139 


vhl 


0.449 0.463 


0.414 0.379 


0.319 


0.311 


wrn 


0.698 0.254 


0.667 0.333 


0.098 


0.228 


xpc 


0.292 0.333 


0.375 0.417 


0.101 


0.083 


wfsl 


0.874 0.856 


0.727 0.727 


0.268 


0.284 


GCDH 


0.977 1.000 


0.889 1.000 


0.142 


0.120 


tulpl 


0.091 0.083 


0.545 0.542 


0.619 


0.667 


Avg. 


0.564 0.485 


0.609 0.572 


0.224 


0.245 



results in improvements for most queries and makes LR_M the best ranker both 
by mean average precision and mean PRBE. 

The results are intriguing in more than one sense. For once, LR_M and ORR 
perform comparable to [3] who reported 58.89% precision and 69.28% recall 17 , 
even though we did not use stemming, lexical normalization, or biological back- 
ground knowledge; only 67% training data instead of the 80% implicit in their 
five-fold cross-validation; and although they did not remove overlapping docu- 
ments between training and test queries which may have lead to an overestima- 
tion of precision 18 . 

Our results agree well with common wisdom within text mining that ranking 
approaches with simple word vector representations are competitive to much 
more elaborate approaches 19 . What is even more intriguing is that the word 
‘missense’ which is so useful in ranking relevant documents is not even mentioned 
in [3], although some multi-word phrases containing this word are mentioned. 

6.2 Risk of Local Search 

To determine the risk of our local search approach, we compared LuceneRanker 
(LR) to LucenelndexRanker (LIR). The only difference between both rankers is 
that while LR creates a local index of all documents within the current query, LIR 
adds all documents within the current query to a one year snapshot of MED- 
LINE obtained via TREC 20 consisting of half a million MEDLINE references 

17 Compare Table 3 - PRBE gives a value of e.g. precision=recall=58.9% for ORR and 
63.7% for LRJM 

18 In unpublished experiments on this very dataset, not removing overlapping docu- 
ments led to an overestimation in average precision by 0. 1 ( !) 

19 Personal communications, Walter Daelemans. 

20 trec.nist.gov This is the snapshot that was used for the TREC2003 Genomics 
track and was later made publicly available. 
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Table 6. Predicted synonyms, separated by comma. These were removed from each 
query for the homonymy recognition experiments. 



Query 


Homonyms 


vhl 

xpc 

wrn 

tulpl 

wtl 


HRCA1,RCA1 

P 125 

RECQL2,RECQL3 

R.P14 

WAGR 



indexed between 1st April 2002 and 2003. We consider this to be a reasonable 
approximation to using the full MEDLINE database of twelve million entries. 
Table 5 shows the results. As can be seen, LIR performs somewhat similar to 
LR, and on average performs worse. Clearly, LIR does not perform much better 
as may have been expected from the fact that term and document frequencies 
are better estimated in the larger index. It seems that small, query-dependent 
full-text search may also have its advantages. 

Concluding, the risk of our local search approach seems to be marginal. It 
seems that having a local MEDLINE installation is not essential. 

6.3 Homonymy Recognition 

Lastly, we investigated whether removal of homonyms from the expanded queries 
improves the ranking. Based on the reasonable assumption that each synonym 
group concerns a single protein/gene entity, we consider homonyms to be names 
which appear in more than one synonym group. Five 21 of our nine test queries 
had at least one term which was present in more than one group, see Table 6. 
We removed these terms from the queries and reran LR plus SR. ORR and NBR 
did not show any changes, since exactly the same set of documents was returned 
for each query. 

Results indicate that the improvement is marginal at best and slightly neg- 
ative at worst. Overall the performance is almost indistinguishable. Generally, 
3.9% of synonyms for species Homo sapiens within our database are homonyms 
according to our approach which roughly agrees with the proportion of search 
terms removed for our test queries. Thus, the practical consequences of the 
homonymy problem seem to be negligible in our case. 



7 Related Research 

[3] report on a refined approach to predict relevance from the same medical 
annotation dataset. They use normalisation of gene/protein names, a special 

21 Initially we additionally found two homonyms for wfsl, but feedback from domain 
experts enabled us to trace the wrong homonym to an erroneous entry in an imported 
online database, which has since then been corrected. All entries shown here are 
verified homonyms. 
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Table 7. Left: Average precision. Avg., mean average precision. Right: Precision- Recall 
break even point. Avg., average PRBE. LRh, SRh are ranking results where homonyms 
were removed from the query. 





LR LRh 


SR SRh 


wtl 

vhl 

wrn 

xpc 

tulpl 


0.199 0.210 
0.449 0.442 
0.698 0.649 
0.292 0.292 
0.091 0.091 


0.164 0.169 
0.407 0.416 
0.438 0.438 
0.171 0.171 
0.111 0.111 


Avg. 


0.434 0.420 


0.363 0.365 





LR LRh 


SR SRh 


wtl 

vhl 

wrn 

xpc 

tulpl 


0.118 0.118 
0.414 0.414 
0.667 0.556 
0.375 0.375 
0.545 0.545 


0.118 0.118 
0.431 0.431 
0.333 0.333 
0.350 0.350 
0.556 0.556 


Avg. 


0.474 0.456 


0.419 0.419 



Table 8. Average rank within recent query. Avg., mean of average relative rank. LRh, 
SRh are ranking results where homonyms were removed from the query. 





LR LRh 


SR SRh 


wtl 

vhl 

wrn 

xpc 

tulpl 


0.410 0.390 
0.319 0.325 
0.098 0.120 
0.101 0.116 
0.619 0.619 


0.520 0.510 
0.312 0.313 
0.220 0.224 
0.186 0.187 
0.375 0.375 


Avg. 


0.302 0.310 


0.322 0.322 



Part-Of-Speeclr tagger, feature selection and creation of new features based on 
Journal names, which was input into a Probablistic Latent Categorizer (PLC). 
Their results are comparable to our much less elaborate approach which follows 
the basic machine learning approach towards ranking. 

[5] gives a good overview of current approaches in literature data mining, 
also including some approaches to ranking. 

[11] is a very comprehensive approach to named entity recognition of protein 
and gene names from biological literature. He also tackles word sense disam- 
biguation shortly with good success. Parts of his work have been integrated into 
the Gene Ways project. A synonym resource somewhat similar to the one used 
within BioMinT can be found at http://synonyms.cs.columbia.edu/ and is 
based on this work. Contrary to our resource, it also incorporates information 
extracted from references and full-text papers, and its synonyms have been ex- 
tensively reviewed by domain experts. 

[7] introduces a system to discriminate papers concerned with protein-protein 
interactions from others papers. They used a Bayesian approach and a log- 
likelihood scoring function and report promising results. Results are given in 
forms of coverage, accuracy and log-likelihood distributions, none of which can 
be easily compared to our results. 

[10] use boosted Bayesian classifiers and Support Vector Machines to learn 
a discrimination model for papers that should be included into a speciality 
database. Negative examples were generated by using related documents from 
MEDLINE which are not part of the speciality database and thus are assumed to 
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have been rejected. They report average precision on the top 100 documents of 
80% for the best system. However, their task is completely unrelated to medical 
annotation and so also cannot be compared. 

www . e-biosci . org is a resource that has grown out of another EU research 
project. While not specifically adressing document relevance ranking, its goal is 
somewhat similar to BioMinT, namely “...a next generation scientific information 
platform that will interlink genomic and other factual data with the life sciences 
research literature.” 

8 Conclusion 

We investigated the relative performance of our rankers on a dataset dealing with 
medical annotation. Surprisingly, a quite simple ranker based on the occurrence 
of a single word, ORR, was the most successful of the initially considered rankers. 
In an extension of our experiments, adding this single significant word to each 
search query yielded an improvement to a query-independent ranker, improv- 
ing one the simple ranker and yielding comparable results as a state-of-the-art 
approach from [3]. This is insofar intriguing as we did not use lexical prepro- 
cessing or biological background knowledge; and that the word we found is not 
reported as most significant there, although some multi-word phrases containing 
‘missense’ were reported. 

We investigated whether processing the document set returned from query- 
ing PubMed is competitive to using a significant subset of the full MEDLINE 
database locally, in a simplified setting. It turned out that this is indeed the 
case - if anything, processing the document set from PubMed seems slightly 
preferable. 

Lastly, we investigated whether the removal of automatically recognized 
homonyms from the query improves the ranking. It turns out this is not the 
case, so in the context of ranking for medical annotation the homonymy prob- 
lem seems negligible. 
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Abstract. This paper shows an application of Bayesian Programming 
to model a simple artificial life problem: that of a worm trying to live 
in a world full of poison. Any model of a real phenomenon is incomplete 
because there will always exist unknown, hidden variables that influ- 
ence the phenomenon. To solve this problem we apply a new formalism, 
Bayesian programming, which has previously been used in autonomous 
robot programming. The proposed worm model has been used to train a 
population of worms using genetic algorithms. We will see the advantages 
of our method compared with a classical approach. Finally, we discuss 
the emergent behaviour patterns we observed in some of the worms and 
conclude by explaining the advantages of the applied method. 

Keywords: Bayesian Programming, Artificial Life, Life Formalization 
Model 



1 Introduction 

Articial Life is a relatively recent discipline whose primary goal was to study the 
recreation of biological phenomena using artificial methods. 

Nevertheless, applications in this field have quickly exceeded purely biolog- 
ical applications: the methods used can be useful in the study, simulation and 
behaviour prediction of a wide set of complex systems, not only biological. 

The immediate applications of Artificial Life are in the simulation of complex 
processes, chemical synthesis, multivariate phenomena, etc. 

Very complex global behaviour patterns can be observed, initiated by simple 
local behaviour. It is this characteristic (sometimes called emergent behaviour) 
which makes Artificial Life particularly appropriate for the study and simulation 
of complex systems for which detailed analysis, using traditional methods, is 
practically non- viable. 

Nevertheless, it is necessary to bear in mind that any model of a real phe- 
nomenon will always be incomplete due to the permanent existence of unknown, 

* This work has been financed by Spanish Comision Interministerial de Ciencia y 
Tecnologia (CICYT) project number TIC2001-0245-C02-02 and by the Generalitat 
Valenciana project GV04B685 



J.A. Lopez et al. (Eds.): KELSI 2004, LNAI 3303, pp. 124-138, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




A New Artificial Life Formalization Model: A Worm with a Bayesian Brain 



125 



hidden variables that will influence the phenomenon. The effect of these vari- 
ables is malicious since they will cause the model and the phenomenon to have 
different behavioural patterns. In this way both artificial systems and natural 
systems have to solve a common problem: how each individual within the system 
uses an incomplete model of the environment to perceive, infer, decide and act 
in an efficient way. 

Reasoning with incomplete information continues to be a challenge for artifi- 
cial systems. Probabilistic inference and learning try to solve this problem using 
a formal base. A new formalism, the Bayesian programming (BP) [1], based on 
the principle of the Bayesian theory of probability, has been successfully used in 
autonomous robot programming. Bayesian programming is proposed as a solu- 
tion when dealing with problems relating to uncertainty or incompleteness. 

Certain parallelisms exist between this kind of programming and the struc- 
ture of living organisms. As shown in a theoretical way in [2], we can suppose 
that if a natural process correspond to all the steps in a Bayesian program, then 
live organisms also use Bayesian inference and learning. In this way, natural 
evolution provided living beings with both the pertinent variables, and the ad- 
equate decomposition and parametric forms. The pertinent variables may have 
been obtained by selecting the sensors and actuators in order to supply vital in- 
formation. The decomposition would correspond to the structure of the nervous 
system, which basically expresses dependencies and conditional independencies 
between variables. The parametric forms can be seen as the information pro- 
cessing units, implemented by neurons and assemblies of neurons. Given this 
apparatus, corresponding to preliminary knowledge, each individual in his life- 
time could answer the first question by experimenting and learning the values of 
the free parameters of his nervous system. 

In this paper we will see a simple example of how to apply BP formalism to 
a specific artificial life problem. We will define a virtual world, divided into cells, 
some of which contain poison. In this world lives a worm with only one purpose, 
to grow indefinitely. In order to grow the worm must move through a certain 
number of non poisonous cells in its world. If the worm moves into a poisonous 
cell then it will die. The worm has a limited vision of the world, provided by 
its sensorial organs, found in its head. These sensors allow the worm to see no 
further than the adjacent cell. 

We believe that this is one of the first approaches that uses Bayesian pro- 
gramming for the formalization of an artificial life problem as we haven’t found 
any evidence of it’s application in this field. The main area where BP has been 
and continues to be applied is robotics [2] , [1] , [3] , [4] , [5] . 

This paper is divided into five parts. The first is a short formal introduction 
to Bayesian Programming, the second describes the problem and it’s description 
in terms of BP, the third shows the way we create the environment to evolve the 
worms, the fourth is about the experimentation and the emergent behaviours 
we witness in the worm population and finally, the fifth, draws conclusions and 
future lines of investigation to be followed. 
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2 Bayesian Programming 

Before specifying our problem we will introduce the reader to the principles 
and basics of Bayesian programming showing the basic concepts, postulates, 
definitions, notations and rules that are necessary to define a Bayesian Program. 

2.1 Basic Concepts 
Definition and Notation 

Proposition. We will use logical propositions denotated by lowercase names. 
Propositions may be composed to obtain new propositions using the usual 
logical operators: (A,V,->) denoting the conjunction, disjunction and the 
negation respectively. 

Discrete variable. Discrete variables will be denoted by names starting with 
one uppercase letter. By definition, a discrete variable X is a set of logical 
propositions x, such that these propositions are mutually exclusive (V,,j , i ^ 
j, Xi Ayi = false) and mutually exhaustive (at least one of the propositions 
Xi is true). x* stands for X takes its i th value. [X\ denotes the cardinal of 
the set X (the number of the propositions xf). 

The conjunction of two variables X and Y, denoted by X (g> Y, is defined as 
a set of |XJ 0 [Y\ propositions x* Ay,. X (g> Y is a set of mutually exclusive 
and exhaustive logical propositions (consequently it is a new variable). Of 
course, the conjunction of n variables is also a variable. The disjunction of 
two variables, defined as the set of propositions Xi V iji is not a variable 
because these propositions are not mutually exclusive. 

Probability. To be able to deal with uncertainty, we will attach probabilities 
to propositions. We consider that, to assign a probability to a proposition a, 
it is necessary to have at least some preliminary knowledge, summed up by 
a proposition ir. Consequently, the probability of a proposition a is always 
conditioned, at least, by n. For each different 7r,P(-|7r) is an application as- 
signing a unique real value P(a |7r) in the interval [0, 1] ,to each proposition a. 
Of course, we will be interested in reasoning on the probabilities of the con- 
junctions, disjunctions and negations of propositions, denoted, respectively, 
by P(aA6|7r), P(aV6|7r), P(-ia|7r). We will also be interested in the probabil- 
ity of a proposition a conditioned by previous knowledge tt and some other 
proposition b. This will denotated P(a\b A n). 

For simplicity and clarity, we will also use probabilistic formula with vari- 
ables appearing instead of propositions each time a variable X appears in a 
probabilistic formula <1>{X) it should be understood as Vx* £ X, ^{xf). 



Inference Postulates and Rules. This section presents the inference postu- 
lates and rules used to carry out probabilistic reasoning 

Conjunction and normalization postulates for propositions. In prob- 
abilistic reasoning only two basic rules are defined and used to derive the 
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rest. These two rules, if we use discrete probabilities, are sufficient to solve 
any inference problem. 

— Conjunction Rule. Gives the probability of a conjunction of propositions. 

P(a Ab\n) = P(a\n) x P(b\a Att) , , 

= P(b\n) x P(a\bAn) i j 

— Normalization Rule. States that the sum of the probabilities of a and ->a 
is one. 



P(a\ 7 r) + P(->a \ n) = 1 (2) 

Using the last two we derive the next rules for propositions and for variables: 
— Disjunction rule form propositions. 

P{a V b\n) = P(a|-7r) + P(b\n) — P{a A t>| 7r) (3) 

— Conjunction rule for variables. 

P(X®Y\TT) = P(X\ir)xP(Y\XAir) 

= P(Y\n) x P(X\Y An) F ’ 

— Normalization rule for variables. 

1™ = ! (5) 

— Marginalization rule for variables. 

Y J P{X®Y\n) = P{Y\n) (6) 

A' 

2.2 Bayesian Program Definition 

A Bayesian program is defined as a means of specifying a family of probability 
distribution. The constituent elements of a BP are presented in figure 1. 



Program < 



tion l ‘ 



{ Pertinent variables 
Decomposition 
„ f Parametric 
F0rmS | Programs 
[ Identification based on Data(<5) 



Fig. 1. Structure of a Bayesian program 




128 Fidel Aznar Gregori et al. 



Description. The purpose of a description is to specify an effective method of 
computing a joint distribution on a set of variables {X 1 , X 2 , ..., X n j given a set 
of experimental data S and preliminary knowledge ir. This joint distribution is 
denoted as P{X 1 ® X 2 ® ... ® X n \ S 0 7 r) 



Preliminary Knowledge. To specify preliminary knowledge the programmer 
must undertake the following: 

— Define the set of relevant variables {X^X 2 , ..., X”} on which the joint dis- 
tribution is defined. 

— Decompose the joint distribution. Given a partition of {X 1 , X 2 , ..., X n } into 
k subsets we define k variables L 1 , ..., L k each corresponding to one of these 
subsets. 

Each variable L l is obtained as the conjunction of the variables { X 11 , X 12 , . . . } 
belonging to the subset i. This way the conjunction rules leads to: 

P(X 1 ® X 2 ® ... ® X n \5 0 7 r) = P(L 1 |(5 ® 7r) x P(L 2 \L 1 ® 5 0 7r) x ...x 

x P(L fc |L fc_1 ® L k ~ 2 ® ... ® L 1 ® 6 ® tt) 

(7) 

Conditional independence hypotheses then allow further simplifications. A 
conditional independence hypothesis for variable L l is defined by picking 
some variables X* among the variables appearing in conjunction P* _1 ® 
L l ~ 2 <8 ••• <8 L 1 , calling R l the conjunction of these chosen variables and 
setting: 

P(L i |P i_1 ® L i ~ 2 ® ... ® L 1 ® 6 0 7r) = P(L i |P i ® S 0 7r) (8) 

We obtain: 

P(X 1 ® X 2 ® ... ® X n \5 0 7r) = P(P 1 |P 1 0 <5 0 7r)x 

xP(L 2 \R 2 ® 6 0 7r) x ... x P(L fc |P fc ® 6 0 7r) ^ 

Such a simplification of the joint distribution as a product of simpler distri- 
butions is called a decomposition. 

— Define the form. Each distribution P(L I |P*0(5®7r) appearing in the product 
(9) is then associated with either a parametric form (i.e., a function /^(P*)) 
or another Bayesian program. In general /i is a vector of parameters that 
may depend on R l or 5 or both. Learning takes place when some of these 
parameters are computed using the data set S. 

Questions. Given a description (i.e., P(X 1 ®X 2 ® ...®X"|<5® 7r)), a question is 
obtained by partitioning {X 1 , X 2 , ..., X n } into three sets: the searched variables, 
the known variables and the unknown variables. 

The variables Searched, Known and Unknow are defined as the conjunction 
of the variables belonging to these sets. We define a question as the distribution: 



P(Searched\Known 0 6 0 7r) 



(10) 
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2.3 Running a Bayesian Program 

Running a Bayesian program supposes two basic capabilities: Bayesian inference 
and decision-making. 

Bayesian Inference. Given the joint distribution P(X 1 ®X 2 (g)...(g)Al”|(5(g)7r)), 
it is always possible to compute any possible question, using the following general 
inference: 

P(Searched\Known ® S ® tt) 

= P ( Searched (g> U nknown\Known ® S ® tt) 

Unknown 

Y P ( Searched ® U nknown ® Knoum\5 0 tt) 

_ Unknown 

P(Known\5 ® tt) 

Y P ( Searched ® Unknown ® Known\5 ® n) 

Unknown 

Y Y P (Searched® Unknown (g> Known\8 ® tt) 

Unknown Searched 

= r=- x P ( Searched ® U nknown ® Known\5 ® 7r) (11) 

^ Unknown 

In the third equation (11) the denominator appears to be a normalization 
term. Consequently, by convention it is replaced by 

The general Bayesian inference is a very difficult problem. The problem of 
exact inference has been proved to be NP-hard and the general problem of ap- 
proximate inference too. In this paper we assume all inference problems to be 
solved and implemented using an efficient inference machine. 



Decision-Making. For a given distribution, different decision policies are pos- 
sible. We can search for the best (highest probability) values or we can draw at 
random according to the distribution. We will use the second policy calling it 
Draw(P(Searched\Known ® <5 ® 7r)). 

3 Specifying the Problem Using Bayesian Programming 

We commented above on the existence of a world, composed of n x m cells, 
where each cell Cij could be in any one of four different states: empty, containing 
poison, part of the wall which surrounds this artificial world or it could be hidden 
from view beneath the worm’s tail. In this way a cell CV, = {0,P, M, L}. The 
wall configuration is uniform for each generated world, however, in contrast, the 
distribution of poison is random and varies from world to world. Initially, we 
asume the amount of poisonous cells to be between 5%-10% of the total. 

Within each world lives only a single worm which only objective is to move 
and to grow. A worm grows and increases its length by one unit every time it 
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moves through d cells inside its world. If the worm moves to a cell that is not 
empty then it will die. The only information about the world available to the 
worm is provided by its sensors, located in its head (see figure 2). A sensor is 
only able to see the state of cells adjacent to and in front of the worm’s head, 
no further. 















loll 


0 









Fig. 2. Worm’s vision relating to its head and the direction it has in the world. 



We assume that each worm has a certain knowledge represented as states. 
In this way each worm can stay in one state E t given a reading and a previous 
state. Furthermore, a worm could obtain a reading of the world L t represented 
as a binary triplet which specifies if the cell in the position of its components is 
occupied T’ or not ’O’. Finally, a worm could execute three actions. Go straight 
ahead, turn left or turn right A t = {u,l,r} the actions will be guided only by a 
reading and the actual state of the worm. Once the action A t has been executed 
the worm can change to a new state E t + 1 



3.1 Variables Description 

The first part of a Bayesian program is to define the pertinent variables of the 
problem. 

To develop a movement in the world, the worm only needs to know the 
reading L t of it’s sensor and the actual state E t , in addition to the set of actions 
A it could develop in the world. As we commented previously, an action A t must 
be followed by an instant change in state t + 1. 

In this way we define the following variables for each instant t: 

L t = {000, 001, 010, ..., 111} , \L t \ = 8 

E t = {0,1,2, ..., k}, [E t \ =k+l (12) 

A t = {u,l,r} , |_AJ = 3 



3.2 Decomposition 

We define a decomposition of the joint probability distribution P(L t g) E t - 1 (g) 
E t (g) A\nw) as a product of simpler terms. This distribution is conditioned by 
the previous knowledge tt w we are defining. 
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P {L t ® Et~ 1 < 8 > E t < S > At\ 7 tw) = P(Lt\nw) x P(E t -\\L t < g » 7r w)x 

xP(E t \E t ^i ®L t ® n w ) x P(A t \E t (g> E t - 1 <8» L t ®n w ) = , , 

= P(L*|tT W ) xP(£ t ^|L t ®7r W )x l J 

xP(P t |P t _i®L t ®7r w ) x P(A t \E t ®L t ®n w ) 

The second equality is deduced from the fact that an action only depends on 
the actual state and the reading taken. 



3.3 Parametrical Forms 

In order to be able to solve the joint distribution we need to assign parametrical 
forms to each term appearing in the decomposition: 

P(L t \n\y) = Uniform 
P{E t - i | L t <g) n w ) = Uniform. , . 

PiEtlEt^QLtSirw) = 

P{A t \E t <S> Lt® 7 tw) = G (ii(E t , L t ), cr {E t , L t )) 

We assume that the probability of a reading is uniform because we have 
no prior information about the distribution of the world. In the same way we 
consider that all possible worm states can be reached with the same probability. 

Give a state E t -\ and a lecture L t we believe that only one state E t would 
be preferred. In this way the distribution P(E t \E t _i <8> L t ® 7r w) is unimodal. 
However, depending on the situation, the decision to be made may be more or 
less certain. This behaviour is resumed by assigning a Gaussian parametrical 
form to P(E t \E t _ 1 ® L t ® nw). 

In the same way, given a state and a reading we suppose that an action with 
more or less intensity would be prepared . We assign a Gaussian parametrical 
form to P(A t \E t <g> L t <S> nw)- 

3.4 Identification 

We show a set of free parameters which define the way the worm moves. These 
free parameters, derived from the parametrical form (means and standard devi- 
ations of all the Gaussians [E t _i\ x [Lt] and [E t ] x [L t \), would be the ones 
to be learned. 



3.5 Utilization 

The movement of a worm involves the following steps 

— To obtain a reading L t from the worm’s sensors. 

— To answer the question Draw(P(A t \E t (g> L t (g> 7Tw)) 

— The worm will execute the movement command A 

— To answer the question Draw(P(E t+ i\E t (g> Lt ® nw)) 

— The worm will change to the state E t +\ 




132 Fidel Aznar Gregori et al. 



4 Genetic Algorithms 

Genetic algorithms (GA) are a global search technique which mimic aspects of 
biological evolution, namely the process of natural selection and the principle 
of survival of the fittest. They use an adaptive search procedure based on a 
population of candidate solutions or chromosomes. Each iteration or generation 
involves a competitive selection procedure that favours fitter solutions and re- 
jects poorer solutions. The successful candidates are then recombined with other 
solutions by swapping components with one another; they can also be mutated 
by making a small change to a single component. The procedure is repeated for 
many generations, producing new solutions that are biased towards regions of 
the search space in which good solutions have already been found. 

We initially assume that the worm’s parameters are generated randomly. The 
worm only has previous knowledge provided by its knowledge decomposition. 
The learning process would be produced generation after generation, where the 
longest living worms in the world would be those most enabled and adapted to 
reproduce and to maintain their intelligence. 

4.1 Chromosome Codification 

A chromosome is represented using two tables. The first one is formed by 2 • k • 8 
components specifying the Gaussians [E t -i\ x [L t J which represent P(E t \E t -i® 
L t 0 7 tw). 

The second table is formed by the same component numbers specifying the 
Gaussians [E t J x [L t \ which represent P(A t \E t <8 L t 0 nw)- In this way, each 
chromosome contains 32 • k gens. 

In the described experiments, the initial chromosome population is obtained 
by randomly initializing the Gaussian parameters in the following range: 

C ( ii nd • / ^ = k] 

u ^’ ’ ‘ \cr = [0,0.25,0.50,0.75,1] 
for the probability P(E t \E t -\ 0 A* 0 ttw) and 

C ( ii a) ■ i ^ = ^ 

\ a = [0,0.25,0.50] 



for P(A t \E t 0 L t 0 7 x w )- 

4.2 Fitness Function 

The evaluation function contains specific knowledge that is used to assess the 
quality of the solutions. 

In our case we want to reward the worms that live the longest time in the 
world. In this way we describe the fitness function as the number of iterations 
that a worm lives in a randomized generated world. In order to avoid the situation 
where a simple world produces an overvalued worm, we generate w random 
worlds to evaluate each worm’s fitness. 
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As we commented previously each world is composed of cells (empty or full 
of poison) contained within a wall. All worlds are the same size and have the 
same wall disposition, only the quantity and position of poisonous cells varies, 
being selected randomly and comprising between 5% and 10% of the total cells. 

4.3 Selection, Crossover an Mutation Operators 

The selection phase in a genetic algorithm involves creating a mating pool by se- 
lecting individual solutions that are fitter with a higher probability. The selected 
individuals in the mating pool are then combined to create a new population 
using the crossover operator, with occasional small random changes due to the 
mutation operator. We are going to show the operators that have provided the 
best results in the experimentation test. 

Selection operator. We used a stochastic remainder sampling selector (SRS) 
with a two-staged selection procedure. In the first stage, each individual’s 
expected representation is calculated. A temporary population is filled using 
the individuals with the highest expected numbers. Any fractional expected 
representations are used to give the individual more likelihood of filling a 
space. The second stage of selection is uniform random selection from the 
temporary population. In addition we use elitism (the best individual from 
each generation is carried over to the next generation). 

Crossover operator. We use an asexual two-point crossover operator. In this 
way the mother genes will be selected until the crossover point where the 
father genes will be copied. This process will be done for the two tables (see 
figure 3) that describe the chromosome. 

Mutation operator. We define an incremental mutation operator for states, 
in this way given a gene x we define a mutation as: x £ [0, k], mut{x) = 
x +1 MOD k. Suppose we have four states, and that <?2 = 3, if we mutate 
this element we will obtain 92 = 3 + 1MODA = 0. A random mutation 
scheme is used to choose the directions for the worm to take. A new direction 
is generated randomly and then substitutes the original gene. 





1 8> L ® tth 


) 


C2 


R 


















Cl 




+ 




— 




a 














.b 
















/'(.4|/v r X/.*mr) 











Fig. 3. Asexual two point crossover operator for the worm’s chromosome. 
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Fig. 4. Maximum evaluated individual for each number of states. The y axis represents 
the worm’s fitness and the x axis the number of states used ( k ). We utilize 100 execu- 
tions for each state with a population of 250 individuals and 500 generations using the 
operators specified in the previous section. 



4.4 Used Parameters 

Using the operators presented in the previous section we obtain an evolutive 
learning process for a worm. Developing empirical tests we arrive at the con- 
clusion that a number of states (fc) greater than five complicates the learning 
process of the worm and does not improve the movements made by the worm. 
For this reason, in the rest of the experiments, we use a fixed number of states 
equal to five (see figure 4). 

In addition, for the remainder of tests we use d = 5 (for each five cells the 
worm moves through it will increase it’s size by one unit) and w = 6 (six random 
worlds will be generated in order to evaluate each worm). 



4.5 Worms Evolution 

In order to obtain the best individual we use 100 executions for each state with a 
population of 250 individuals and 500 generations using the operators specified 
in the previous section. In figure 5 an example is shown of the executions show- 
ing the fitness of the worst and best performing individual as well as the average 
results obtained, illustrating the algorithm convergence. For each algorithm ex- 
ecution the evaluation took about 2 minutes using a Pentium IV running at 
2Ghz. 
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Fig. 5. We show the evolution of the worst (bottom), the average (in the middle) and 
the best performing individual (top). The y axis represents the worm’s fitness and the 
x axis the actual generation. Until the first 50 iterations an improvement is produced 
in the medium and the worst individual. Then the graph tends to oscillate although 
a slight increase is produced (because the best case increases and maintains it’s level 
through elitism). 



5 Survival Behaviours and Experimentation 

In this section we will analyze some characteristics and emergent behaviours that 
were observed in the worms. Readers of this paper are invited to test our simu- 
lator at the following address http://www.dccia.ua.es/~fidel/woriii.zip. 

Using Bayesian Programming the worm’s previous knowledge is defined and 
mechanisms are given to provide new knowledge to the worm. This data is rep- 
resented using two sets of discrete Gaussians which were learned using genetic 
algorithms. However, we should remember that to get the information of the 
learned distributions we use the Draw function which randomly extracts a value 
for the distribution. In this way we obtain a non-deterministic behaviour, which 
is more adaptable to variations in complex worlds. 

After training the worm population we simulate, in a graphical way, the best 
individual found. It is curious to see different behaviour patterns, which provide 
more survival opportunities. Some of these patterns even seem to imitate natural 
behaviour developed in some animals. 

One of the most common patterns is to follow the edge of the world while 
no poison is found near it (see figure 6a). This is a good way to move if the 
proportion of poison is low near the edges and configurations don’t exist that 
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Fig. 6. Different behavior patterns, a) Follow the edge of the world, b) Zigzag move- 
ment. c) Ping-pong behaviour, d) In the movement the worm seems to follow its tail. 
The arrow points to the next worm displacement. 



trap the worm between the perimeters and the poison. Another curious behaviour 
is the development of a zigzag movement emulating the way some snakes move 
(see figure 6b) so reducing the area that the worm occupies in the world. In 
addition it is quite common for the worm to move up and down like a ping-pong 
ball (see figure 6c). Finally, we underline the movement of some worms which 
seem to move as if trying to reach their tails, so forming a spiral (see figure 6d). 

The behaviour described above (and some others) are repeated and combined 
with the obtained worms. These behaviour are not programmed implicitly, they 
have been obtained using the proposed brain model and selected using an evolv- 
ing process in a population. 

5.1 Comparing Our Model with a Clasical One 

We can see some advantages of our method if we compare it to a more clas- 
sical model approach, the finite states machine (FSM) [6]. This approach has 
various drawbacks. First, it assumes a perfect world model, which is false. It is 
necessary to know that any model of a real phenomenon is incomplete because 
there will always exist non-considered, hidden variables, that will influence the 
phenomenon. The effect of these variables is malicious since they will cause the 
model and the phenomenon to show different behavioural patterns 

Second, a FSM develop a deterministic behaviour therefore in certain world 
configurations it will fail. On the other hand, a Bayesian model has not a de- 
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terministic behaviour and two different executions in the same world may have 
different results, which provide greater adaptability to changes in the environ- 
ment configuration. 

6 Conclusions 

In this paper we have seen an application of Bayesian programming in an artificial 
life system. The formalism of the artificial life models is a continuous field of 
investigation because of the complexity of the systems we work with [7], [8]. In 
addition, we have the added difficulty of working with uncertainty and include it 
into the model we want to use. The Bayesian programming brings up a formalism 
where implicitly, using probabilities, we work with the uncertainly. 

In a world with randomly distributed poison lives a worm, which main pur- 
pose is to grow up. We propose a formalization of the virtual worm using a 
decomposition in terms of a joint probability distribution of their knowledge. In 
this way, applying the Bayesian Programming we obtain a versatile behaviour 
adaptable to changes and what is more a mathematical description of the prob- 
abilistic environment model. 

We have seen some advantages of our method comparing it to a more classical 
model approach, the finite states machine (FSM [6]) (see section 5.1). 

The learning process, given a worm population, has been developed with 
evolving techniques, using genetic algorithms. The principal reason for using 
GA was because they are a global search technique which mimic aspects of 
biological evolution even though other search techniques could be picked to select 
the worms. Each used chromosome is the codification of the two distributions 
obtained with the previous Bayesian formalism (see figure 3) . 

Satisfactory results were obtained that prove the validity of the proposed 
model. Relatively complex and elaborate behavioural patterns were observed in 
the movements of the most highly adapted worms. These behaviour patterns 
were not implicitly programmed but were obtained in an emergent way using 
the proposed model. 

Bayesian programming is, therefore, a promising way to formalize both artifi- 
cial and natural system models. In this example, we have seen how this paradigm 
can be adapted to a simple, artificial life problem. Future studies will try to model 
different artificial life systems using this new formalism. 
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Abstract. Humanoid robotics requires new programming tools. Programming 
by demonstration is good for simple movements, but so far the adaptation for 
fine movements in grasping is too difficult for it. Grasping of natural objects 
with a natural hand is known as one of the most difficult problems in robotics. 
Mathematical models have been developed only for simple hands or for simple 
objects. In our research we try to use data directly obtained from a human 
teacher as in imitation learning. To get data from users we built a data glove, we 
collected data from different experiments, and generalized them through neural 
networks. Here we discuss the nature of the data collected and their analysis. 



1 Introduction 

Haptic sense is very important for human beings, especially during activities like 
manipulation. Sometimes it is possible to do a task without visual feedback using only 
tactile and force sensations. This is the reason why future Virtual Reality (VR) sys- 
tems should be improved by devices capable to acquire somatic-sensory data (like 
articulation positions and velocities) and able to evoke touch and force feelings. 

On the other side, the control of a robotic hand can be much easier if the complex 
data about positions and force are learned from a human teacher and not developed 
from geometric and dynamic equations. An expert performs the grasp, and phalanx 
positions and fingertip forces are acquired only when the object is firmly gripped. 
Then data are used by a Neural Network to learn how to generate position and force 
to grasp objects with some generalization. This is useful for example to control an 
artificial hand without computing the inverse kinematic and dynamic problem. 

The challenge of our investigation is the possibility to teach grasping to a human- 
oid hand after learning from human grasping. To obtain data from the human teacher 
we designed a special glove, as we will illustrate in the following. 

Grasping in humans is a complex activity and takes place in two steps: planning, 
which requires encephalon activity, and executing, which requires activity of the 
neuro-motor system. During execution the cerebellum plays an important role in 
comparing the reference from the encephalon and the sensorial data from the motor 
system. 

Different postures are available for human grasping. They differ in the number of 
degrees of freedom and in the force exerted. The chosen posture depends on the prop- 
erties of the object and on the task. Many authors have proposed interpretations about 
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the generation of forces and positions. In [1, 2, 3] three basic directions (or primitives) 
are defined, as illustrated in Figure 1 . 

1. Pad opposition moves the fingers in a direction parallel to the palm. It identifies 
the direction x. 

2. Palm opposition moves along a direction perpendicular to the palm, and identifies 
the z axis. 

3. Side opposition is along a direction along the palm and identifies the y axis. 




Fig. 1 . A. Pad opposition; B. Palm opposition; C. Side opposition 

In Pad Opposition, the hand can exerts small movements and applies small forces. 
It is typical of precision movements. In Palm Opposition the hand can use large 
forces. In Side Opposition the result is intermediate: medium precision and medium 
forces. 

Cutkosky et al [4, 5] proposed a classification that integrates the relevance of the 
task considered with the precision and the power of grasping. Some classes have been 
modelled with the aim to automatically produce actuation commands for a given hand 
grasping a given object. 

Considering the difficulties of mathematical models for the hand and the object, we 
see here how to learn from data acquired from a human teacher. 

In Section 2 we illustrate the use of Neural Networks to learn grasping from differ- 
ent trials. 

In Section 3 we describe our data glove used as a tool to get data, and in Section 4 
we use data to infer grasping positions and force on new objects. We adopted in prac- 
tice the pad opposition and the side opposition schemas to grasp simple objects and to 
analyse position and force data applied by a human to get a model for an artificial 
hand with similar kinematics. 

In Section 5 we show the kinematics of the humanoid hand that will execute the 
motions. After we discuss the results and conclude. 



2 Neural Networks for Grasping 

The learning capability of neural networks has been applied to many fields in robotics 
as well as in many data analysis problem. Some NN architectures have been reported 
in literature about grasping. 
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Some authors proposed to learn the hand position for grasping. Kuperstein and 
Rubinstein [6] used a gripper with 5 degrees of freedom. Taha et al. [7] studied how 
to move a hand with 7 d.o.f. to grasp objects of different shapes. They applied those 
ideas to the control of prosthesis. They defined two postures of the hand: at time 0, 
when it is open, at time 1 when the object is grasped, for a given type of objects and 
sizes. The NN was able to compute the intermediate positions to obtain the grasp. 

Moussa and Kamel studied how to learn generic grasping positions as a mapping 
between the reference frame of the object and the hand using the contact information 
from the tip of 5 fingers. A module makes the configuration of the hand position, 
another module defines the fingers positions. Fingers positions are determined by a 
network making the inverse kinematics. Taha et al. Have H. Huang et al. developed 
CANFM (Cascade Architecture of Neural Network with Feature Map) to classify 8 
kinds of grasping using as input the EMG signals collected from 3 places on the pa- 
tient arm. The final task is to classify them according to the 8 kinds of grasps. The 
first layer is a Kohonen network, which receives 3 input from sensors. The output 
from the SOM selects the coordinates x-y in a 2D topologic net, used to provide 6 
input to a BPNN. 

The last research to mention here is from Matsuoka [8], developed for the Cog ro- 
bot with a hand of four fingers. Every finger has 4 dof, actuated by a motor through a 
cable (tendon), each with position sensor. Force sensor are distributed over the hand. 
The grasp activity of Cog imitates the human grasping reflex of babies. The hand 
controller has two parts: the first is a reflex control, the second is a NN. Data are col- 
lected from experiments on the real robot: time of closure and fores, for different 
objects, are the input to the network (see Figure 2). 




Fig. 2. The Cog controller 



We chose neural networks for their easy implementation and integration in the 
Matlab/Simulink environment, and for their real-time behaviour after training. 



3 Data Acquisition with Our Data Glove 

The glove we designed is equipped with 16 sensors: 14 positions sensors and 2 force 
sensors. Position sensors measure the flexion of each phalanx for every finger, except 
the little finger. There are also two special sensors that measure the adduction for the 
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thumb and the index. Indeed these two fingers have a first phalanx with an improved 
mobility, especially for the adduction and abduction movement. 

Sensor signals are acquired using an electronic board, that is connected with an 
analog/digital card (Pci 812) mounted on a Pc, and using the XpcTarget tool of Mat- 
lab. Data are sampled and converted into digital format with a frequency of 2KHz. 
We see in Figure 3 the glove. 




Fig. 3. The data glove 



Our data glove acquires 16 signals: 

• 12 positions for the phalanxes of first 4 fingers 

• 2 abduction for thumb and index 

• 2 forces for thumb and middle 

The samples of grasps have been designed to learn different sizes and different ma- 
terials. 

The grasp objects are made of: 

• Polystyrene (density= 0.029297 — ) 

cm 3 

• Wood (density= 0.70898-^— ) 

cm 3 

• PV C (density= 1 .42 19 ^ ) 

cm 3 

Objects of 7 different dimensions for each material are provided, as illustrated in 
Figure 4. 

For each object size and material, ten grasps are monitored, and the positions of 
phalanxes, the abductions, the forces, are stored in an array of 16 columns. The rows 
are related to different times during grasping; with a sampling time of 0.0005 sec, 500 
rows for each grasp are usual. 

Three steps are needed before learning: 

1. validation 

2. deriving the force on the index 

3. normalization 
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Fig. 4. The cubes used in the experiments 



The grasp is different for different dimensions of the object. For the small cubes, of 
2 or 3 cm size, the grasp involves only thumb and index, applying two forces of equal 
module and in the opposite directions. 




Fig. 5. Grasping a 3cm cube, with the directions of forces 

For bigger cubes also the middle finger is used on the same surface of the index. 
The force of the thumb equals the sum of forces of index and medium. 

The forces of the index are computed as 

1. for small 2 or 3 cm cubes: F jndex = F thumh = F mjddle « 0 

2. cubes of 4 to 8 cm: F index = F t]mmb - F middle 

Input data to the network are two: cube dimension and density of the material, 
normalized in [0,1]. 

Densities are normalized using the maximum densities as before reported, while 
the dimensions are normalized to the maximum size of 10 cm 





144 Michele Folgheraiter, Ilario Baragiola, and Giuseppina Gini 




Fig. 6. Grasping a 6 cm cube 

The output are position and forces for grasping, normalized in [-1,1] according to 
the formula: 
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1 



with X mm minimum value in the distribution, X max maximum value in the distribu- 
tion. 

Examples of input and output are in Table 1. In all the tables numbers are with the 
comma notation for decimal part, as in out Matlab settings. 



4 Training the Network 

A feedforward network with 2 inputs, 17 outputs and 2 hidden layers with 20 neurons 
each, and a tangent sigmoid transfer function is the chosen architecture. The Matlab 
algorithm “traingdx”, a fast backpropagation with heuristic techniques is used. The 
performance of the algorithm depends on the learning rate, which is adaptively ad- 
justed. 

After different learning, the best network in terms of number of neurons, training 
time, MSE is chosen. The MSE is illustrated in Figure 7. As we see, after 300 epochs, 
the error is stabilized. 

We compare the real data obtained from the average on all the materials for the 
same cube size and compute the error and the variance. We see in Table 2 the results 
for position errors, while in Table 3 the results for force errors. 

We applied early stopping to avoid overfitting of the network. This method re- 
quires that data are divided in 3 sets: the training set (used to compute the gradient 
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Table 1 . Input and output samples 





Density 


size 


Input 1 


0,94793 


0,2 


Input2 


0,94793 


0,2 


Input3 


0,94793 


0,2 


Input-4 


0,94793 


0,2 


Input5 


0,94793 


0,2 





11 


12 


13 


Ml 


M2 


M3 


A1 


A2 


T1 


0,54135 


- 0,42844 


- 0,10097 


- 0,094291 


- 0,50748 


- 0,065206 


0,45685 


- 0,58024 


T2 


0,54135 


- 0,42844 


- 0,10097 


- 0,094291 


- 0,50748 


- 0,065206 


0,45685 


- 0,58024 


T3 


0,54135 


- 0,42844 


- 0,10097 


- 0,094291 


- 0,50748 


- 0,065206 


0,45685 


- 0,58024 


T4 


0,54135 


- 0,42844 


- 0,10097 


- 0,094291 


- 0,50748 


- 0,065206 


0,45685 


- 0,58024 


T5 


0,54135 


- 0,42844 


- 0,10097 


- 0,094291 


- 0,50748 


- 0,065206 


0,45685 


- 0,58024 


T6 


0,54135 


- 0,42844 


- 0,10097 


- 0,094291 


- 0,50748 


- 0,065206 


0,45685 


- 0,58024 


T7 


0,54135 


- 0,42844 


- 0,10097 


- 0,094291 


- 0,50748 


- 0,065206 


0,45685 


- 0,58024 


T8 


0,54135 


- 0,42844 


- 0,10097 


- 0,094291 


- 0,50748 


- 0,065206 


0,45685 


- 0,58024 


T9 


0,54135 


- 0,42844 


- 0,10097 


- 0,094291 


- 0,50748 


- 0,065206 


0,45685 


- 0,58024 


T10 


0,54135 


- 0,42844 


- 0,10097 


- 0,094291 


- 0,50748 


- 0,065206 


0,45685 


- 0,58024 


Til 


0,53884 


- 0,42494 


- 0,0953 


- 0,098169 


- 0,50475 


- 0,065206 


0,44193 


- 0,5841 


T12 


0,53884 


- 0,42494 


- 0,0953 


- 0,098169 


- 0,50475 


- 0,065206 


0,44193 


- 0,5841 


T13 


0,53884 


- 0,42494 


- 0,0953 


- 0,098169 


- 0,50475 


- 0,065206 


0,44193 


- 0,5841 


T14 


0,53884 


- 0,42494 


- 0,0953 


- 0,098169 


- 0,50475 


- 0,065206 


0,44193 


- 0,5841 


T15 


0,53884 


- 0,42494 


- 0,0953 


- 0,098169 


- 0,50475 


- 0,065206 


0,44193 


- 0,5841 


T16 


0,53884 


- 0,42494 


- 0,0953 


- 0,098169 


- 0,50475 


- 0,065206 


0,44193 


- 0,5841 


T17 


0,53884 


- 0,42494 


- 0,0953 


- 0,098169 


- 0,50475 


- 0,065206 


0,44193 


- 0,5841 


T18 


0,53884 


- 0,42494 


- 0,0953 


- 0,098169 


- 0,50475 


- 0,065206 


0,44193 


- 0,5841 




Fig. 7. MSE after training 

and the weights), the validation set (used to monitor the error), and the test set. In 
Figure 8 we see in blue the MSE on the training set, in green on the validation set. 
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Table 2. Results of position error analysis 



PVC 




Cube 2 cm 


Cube 4 cm 


Cube 6 cm 


Cube 8 cm 


Mean error 
Variance 




0,42869 

2,4075 


-0,04124 

3,135 


0,29123 

0,75822 


0,013478 

0,32512 




WOOD 




Cube 2 cm 


Cube 4 cm 


Cube 6 cm 


Cube 8 cm 


Mean error 
Variance 




0,25967 

2,0385 


0,78452 

4,8306 


0,84806 

4,7805 


0,42862 

1,3845 




POLYSTYRENE 


Cube 2 cm 


Cube 4 cm 


Cube 6 cm 


Cube 8 cm 


Mean error 
Variance 


0,033329 

0,48287 


0,45582 

3,5469 


0,2648 

1,8219 


0,3304 

0,92951 



Table 3. Force error analysis 



PVC 




Cube 2 cm 


Cube 4 cm 


Cube 6 cm 


Cube 8 cm 


Mean error 
Variance 




-0,005613 

0,000093 


-0,000561 

0,000017 


-0,002300 

0,000031 


-0,015903 

0,000704 




WOOD 




Cube 2 cm 


Cube 4 cm 


Cube 6 cm 


Cube 8 cm 


Mean error 
Variance 




-0,014947 

0,000340 


-0,003487 

0,000765 


-0,000958 

0,000196 


0,008940 

0,000128 




POLYSTYRENE 


Cube 2 cm 


Cube 4 cm 


Cube 6 cm 


Cube 8 cm 


Mean error 
Variance 


-0,000378 

0,000051 


-0,004563 

0,000014 


-0,000323 

0,000004 


-0,000960 

0,000013 




Fig. 8. MSE training set, validation set 



The training stops after 117 epochs with MSE of 0.00654269 when the error on the 
validation set starts growing. 

After training, we separately analyse the errors on the training set and on the test 
set. Results on the training set are reported in Table 4. 
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The test set contains instead cubes made of the three materials and of 3, 5, 7 cm. 
Repeating the analysis on the test set , we observe that the error (in degrees) for posi- 
tions has a very low average (less than 2 degrees) but a high variance. The error for 
force is around 0.01 kilograms, and the variance is low. 

Comparing results after controlling the overfitting, we observe a clear improve- 
ment on the test set in comparison with results obtained before checking overfitting. 



Table 4. Errors on the prediction of the network on the training set 



PVC 


Cube 2 an 


Cube 4 cm 


Cube 6 cm 


Cube 8 cm 


Mean error 
Variance 


0,49529 

4,8442 


-0,26398 

8,6617 


0,58482 

9,7215 


-0,92526 

7,8894 



WOOD 


Cube 2 cm Cube 4 cm 


Cube 6 cm 


Cube 8 cm 


Mean error 
Variance 


0,54215 0,41388 

11,556 8,6249 


-1,0726 

14,358 


-0,32047 

1,8077 





POLYSTYRENE 


Cube 2 cm 


Cube 4 cm 


Cube 6 cm 


Cube 8 cm 


Mean error 


-0,34199 


0,57926 


0,15018 


-0,49648 


Variance 


4,6829 


5,0929 


4,2389 


9,318 



Without overfitting checking, the position error ranges from ±0.5° to ±2.20°, but 
grows as far as ±8.05° for the test set only. Moreover the error is randomly distrib- 
uted. 

We see in Table 5 and 6 the data from Netl, the basic net, and Net2 trained with 
early stopping. 

Table 5. Results from the net without and with early stopping on the training set 



Standard deviation 


2 cm 


4 cm 


6 cm 


8 cm 


PVC Net 1 


1,5516 


1,7706 


0,87076 


0,5702 


PVC Net 2 


1,1134 


1.4791 


2,2453 


1,3917 


Wood Net 1 


1,4278 


2,1979 


2,1864 


1.1766 


Wood Net 2 


2,1143 


2,4236 


1,4239 


1,0625 


Polystyrene Net 1 


0,6949 


1,8833 


1,3498 


0,96411 


Polystyrene Net 2 


1,7062 


2,2723 


1,7304 


2,3702 



Table 6. Results from the net without and with early stopping on the test set 



Standard deviation 


3 cm 


J cm 


7 cm 


PVC Net 1 


5,0229 


3,1659 


3,2374 


PVC Net 2 


7,0026 


5,4505 


5,4312 


Wood Net 1 


6,4726 


6,0083 


4,4586 


Wood Net 2 


7,848 


5,6188 


7,0634 


Polystyrene Net 1 


8,0516 


2,3718 


4,8196 


Polystyrene Net 2 


7,4079 


1,8793 


4,5296 



On the second network, the error on the training set is in the range ±1.06° to 
±2.42°, and on the test set is ±1.9° to ±7.8°. 

With the obtained values we can actuate a real anthropomorphic hand which has 
the kinematics described. The force value will be used by our control system which is 
able to apply given forces on the finger tips. 
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5 The Kinematics of the Humanoid Hand 

We describe the humanoid hand we are using with our humanoid arm. More details 
are in [9, 10]. All the fingers but the thumb are equal, as illustrated in Figure 9. The 
hand mimics the human fingers in size and structure: it has a spherical joint with 2 
degrees of freedom from metacarpus to the first phalanx, and cylindrical joints be- 
tween the phalanxes. The activation is obtained through tendons actuated by McKib- 
ben pneumatic actuators. 

z 




Fig. 9. Side view of the right index, with the joint reference systems 



To compute the direct kinematics we use the Denavit Hartemberg notation to build 
the transformation matrices from the reference systems defined in the joints. 

We assume that the coordinate system O,, X h Y,. Z is moving with R,X,Y,Z. In 
this case the first matrix has ao = 9o = 0, and simplifies to: 

10 0 0 
0 10 0 

T ° mR 0 0 1 0 ^ 
0 0 0 1 



The second matrix is a simple translation as in equation 4: 

10 0 0 
0 1 0 LI 

T = 

lM ° 0 0 1 0 
0 0 0 1 



(4) 



The matrix from 0 2 , X 2 , Y 2 , Z 2 to 0 1; X,. Y ,. Z contains the variables oq and 01: 
cos#, -cos or, -sin 0 t sin or, • cos 0 t -L2-COSOT, -sen#, 

sin 6*| cos or, ■ cos 0 X - sin or, • cos 6 X L2 • cos • cos 6 X 
2m1 0 sin or, cos or, 0 ^ 

0 0 0 1 

With post multiplication we obtain the matrix to transform a point given on the 
second phalanx system 0 2 , X 2 , Y 2 , Z 2 to the basic point R,X,Y,Z. 
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X r 

Y r 

x r 

1 



— Toi_>r x T,^ 0 x T 2h)1 x 



X 2 

Y 2 

Z 2 

1 



(6) 



We need to express a point P on the finger tip, so on the last phalanx, but the actua- 
tion of the last phalanx is not independent from the actuation of the first phalanx, as in 
the human hand. The analysis, as in Figure 10, gives the answer. 



w21 Z2 




Fig. 10. Resolving the last phalanx 

In fact we obtain the values to be used in the previous equation considering the 
contribution of the third joint. 

P(x 2 ) = 0 

P(y 2 ) = L2-cosor 2 + L3cos(or 2 + or 3 ) (7) 

P (z 3 ) = L2sinor, + L3sin(or 2 + or 3 ) 

Another important aspect is about the computation of the inverse kinematics of the 
hand to be able to find the values to actuate to reach a given position. In our case, we 
obtain directly angles values from the data glove, and we need only to transform the 
angles into actuators values, i.e. the length of the McKibben muscles to actuate to 
obtain the given joint angle. This transformation is easily computed from the actuator 
model. 



6 Discussion and Conclusion 

Future work will require to apply to the real robot hand the predicted values and 
check the resulting action. The way to improve the manipulation ability of our robot is 
still long. Other kinds of grasping will be studied, for instance considering the preci- 
sion and pinch grasp. The idea is to generate different networks for the different 
grasping configurations of the hand, and to develop an arbitration network to get the 
good net according to the shape of the object. 

Considering the data acquisition phase, data obtained from the same glove from 
different executors could be compared to understand the variability for different peo- 
ple making the same task and to find more standardized ranges of values. It is our 
opinion that some of the variance of the learned data is simply acceptable and differ- 
ent position/force patterns can reach a stable grasping on the object. 
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With respect to mathematical modelling our approach has the advantage to have a 
very low computational complexity and to be usable without a complete geometric 
description of the object to be grasped. 



References 

1. Iberall T.: The nature of human pretension: three dextrous hands in one, IEEE Trans RA 5, 
3 (1999) 

2. Iberall T.: Human Prehension and dexterous robot hand, International Journal of Robotic 
Research (1989) 

3. Iberall T., Bingham G., Arbib M.A.: Opposition space is a structuring concept for the 
analysis of skilled hand movements. In H. Heuer, H. and Fromm, C. (eds): Generation and 
modulation of action patterns, Springer-Verlag, Berlin(1986) 158-173 

4. Cutkosky M.R., Wright P.K.: Friction, stability and the design of robotic fingers, Interna- 
tional Journal Robotics Research 5, 4 (1987) 20-37 

5. Cutkosky M.R.: On grasp choice, grasp models and the design hands for manufacturing 
tasks, IEEE Trans. RA 5, 3 (1989) 269-279 

6. Kuperstein M., Rubinstein J.: Implementation of the adaptive controller for sensory-motor 
condition. In Pfeifer, R., Schreter, Z., Fogelman, F., Steels, L. (eds): Connectionism in 
perspective, Elsevier Science Publishers (1989) 49-61 

7. Taha Z. et al: Modelling and simulation of the hand grasping using neural networks. Med 
Eng Phys. 1997 Sep;19(6):536-8 

8. Matsuoka, Y: The Mechanisms in a Humanoid Robot Hand, Autonomous Robots 4 , 2 
(2000) 199-209 

9. Folgheraiter, M.. Gini, G.: Blackfingers: an artificial hand that copies human hand in 
structure, size, and function, Proc. IEEE Humanoids (2000) Cambridge, Mass. 

10. Folgheraiter M. Gini, G.: Human-like hierarchical reflex control for an artificial hand, 
Proc IEEE Humanoids (2001)Tokyo, Japan 



JavaSpaces - An Affordable Technology 
for the Simple Implementation 
of Reusable Parallel Evolutionary Algorithms 



Christian Setzkorn and Ray C. Paton 

Department of Computer Science, University of Liverpool, 
Peach Street, L69 7ZF, UK 
{C. Setzkorn, R. C. Paton}@csc .liv.ac.uk 



Abstract. Evolutionary algorithms are powerful optimisation methods and have 
been applied successfully in many scientific areas including life sciences. How- 
ever, they have high computational demands. In order to alleviate this, parallel 
evolutionary algorithms have been developed. Unfortunately, the implementa- 
tion of parallel evolutionary algorithms can be complicated, and often requires 
specific hardware and software environments. This frequently results in very 
problem-specific parallel evolutionary algorithms with little scope for reuse. This 
paper investigates the use of the JavaSpaces technology to overcome these prob- 
lems. This technology is free of charge, simplifies the implementation of paral- 
lel/distributed applications, and is independent of hardware and software envi- 
ronments. Several approaches for the implementation of different parallel evolu- 
tionary algorithms using JavaSpaces are proposed and successfully tested. 

Keywords: Parallel evolutionary algorithms, data mining 



1 Introduction 

Evolutionary algorithms (EAs) are well known optimisation methods. They have been 
applied successfully in many scientific areas including life sciences (e.g. [21]). How- 
ever, EAs require large amounts of computational resources, especially when the prob- 
lems to be tackled become complicated and/or when the evaluation of the candidate 
solutions (individuals) is computationally expensive [3]. This can hamper their practi- 
cality. For these reasons, many researchers have proposed paradigms to execute EAs 
in parallel. This has given rise to Parallel Evolutionary Algorithms (PEAs) (see for 
example [3, 22] for reviews). Unfortunately, the implementation of PEAs can be com- 
plicated, and often requires specific software and hardware. This frequently results in 
very problem-specific applications with little scope for reuse. 

This paper investigates the utility of the JavaSpaces technology 1 for the implemen- 
tation of several types of PEAs. The JavaSpaces technology has many advantages: it 
is free of charge, simplifies the implementation of parallel/distributed applications, and 
is independent of the underlying hardware and software environment. It can therefore 
be used to harvest the computational power of a number of different computers within 

1 JavaSpaces is a trademark of SUN Microsystems, Inc. 
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institutions with different hardware and software specifications (operating systems). To 
our knowledge JavaSpaces itself has never been used to implement PE As. Atienza et 
al. [2] has used the Jini technology to implement PEAs. JavaSpace builds upon Jini, but 
is easier to use as it works on a higher abstraction layer. 

We propose and test three frameworks for the implementation of different PEAs: 
a synchronous master-slave PEA, an asynchronous master-slave PEA, and a coarse- 
grained PEA. The scalability of the implemented approaches is evaluated on a particular 
machine-learning problem; the induction of classifiers from data. It is important to note 
that the proposed frameworks can also be used for other problems. The use of EAs for 
the induction of classifiers has a long tradition. Research in this area started in the early 
1980s (e.g. [12, 13, 16]). An explanation for the popularity of EAs for this task is that 
they can deal with attribute interactions, cope with noise and perform a global search in 
contrast to deterministic approaches [5, 8, 9], Unfortunately, evolutionary approaches 
are often too slow to be used for large real-world data sets. This is due to the fact 
that each individual usually has to be evaluated on all data samples to determine its 
fitness [1], 

The implemented approach induces a particular type of classifier from data: fuzzy 
classification rules systems. This type of classifier was chosen due to its potential high 
comprehensibility [19]. A general introduction to the problem of fuzzy classification 
rule induction can be found in [17], Cordon et al. [4] provide an introduction to the 
construction of fuzzy classification rule systems utilising EAs. The implemented ap- 
proach deploys a multi-objective evolutionary algorithm (MOEA), building upon the 
work of Zitzler et al. [25] and Ishibuchi et al. [15]. A MOEA has been used as there ex- 
ists a trade-off between the fit of a classifier to the data, and its complexity (e.g. number 
of rules). In an earlier study, we have shown that the implemented approach produces 
classifiers that contain fewer rules than other existing approaches. The reader is referred 
to [24], which provides the results and describes the approach in more detail. 

This paper is structured as follows. Section 2 provides some details about the JavaS- 
paces technology. The implemented PEAs are explained in section 3 and their scalabil- 
ity is investigated in section 4. Section 5 contains the conclusions, and suggests avenues 
for future research. 



2 JavaSpaces - A Brief Introduction 

SUN Microsystems Inc. proposed the JavaSpaces specification in 1999. It was inspired 
by the concept of Linda [10]. Linda’s core idea is that a storage space for objects can 
greatly simplify the implementation of parallel and distributed applications. This is 
because it can be accessed by several processes using a small number of operations. 
The storage space is referred to as ‘tuple space’ [10], which inspired the name JavaS- 
paces. SUN Microsystems Inc. offers an implementation of the JavaSpaces specifica- 
tion, which utilises technologies such as Jini, RMI [18,23] and the programming lan- 
guage JAVA [14]. Therefore a JavaSpace can be executed on many different computer 
platforms due to JAVA’S platform independence. 

Processes that run within a network of computers can access the JavaSpace in a con- 
current manner. This allows them to communicate and coordinate their actions by writ- 
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ing and reading objects to and from the JavaSpace. Objects are entities that are defined 
in terms of data and behaviours. They can represent numbers, strings or arrays, and/or 
executable programs (the individuals within an evolutionary algorithm). The JavaSpace 
itself handles the concurrent access, and hence frees the programmer from problems 
such as race conditions. Objects can be exchanged using the operations write, read, and 
take. Transactions [7] can be used to make the operations secure because there are many 
potential sources of failure within a network. For example, objects can get lost on the 
way between a writing/reading process and a JavaSpace. 

Processes can register their interest for a particular type of object with the space. 
If the specific object is written into the JavaSpace, it notifies the processes via a Re- 
moteEvent [20]. Therefore, many different applications (e.g. evolutionary algorithms) 
could use the same JavaSpace at the same time. 

In summary, the JavaSpaces technology provides a shared, persistent, and securely 
accessible opportunity to exchange objects for processes in a computer network. This 
allows the simple implementation of parallel and distributed applications. For an in- 
depth introduction to the JavaSpace technology see for example [6, 7], 



3 The Implemented Approaches 

3.1 The Synchronous Master-Slave Parallel Evolutionary Algorithm 

Figure 1 depicts the structure of the JavaSpaces based synchronous master-slave PEA 
implementation. The synchronous master-slave PEA was first proposed by Grefenstette 
in 1981 [1]. The JavaSpaces implementation was inspired by the compute- server ap- 
proach proposed by Freeman et al. [7]. 

The master executes the evolutionary algorithm. During the fitness evaluation copies 
of individuals are written into the JavaSpace. When the master creates a copy of an 
individual, it attaches a unique identifier and a web server address (URL) to it. Workers 
(slave computers) take individuals from the space (one at a time) and evaluate them. If 
data is required during the evaluation, it is downloaded from the web server indicated 




Fig. 1 . Structure of the JavaSpaces based synchronous master-slave PEA implementation. 
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by the URL attached to each individual. This is only done if the worker has not already 
downloaded the necessary data. A worker does not return the evaluated individual into 
the space but rather a ‘result object’ consisting of the individual’s objective values and 
its unique identifier. The master removes result objects from the space, and uses the 
unique identifier to update the objective values of a particular individual. The master 
carries on with other evolutionary processes as soon as it has updated the objective 
values of all individuals. This approach has many advantages, which are listed and 
explained below. 

o It is easy to implement and does not change the dynamics of the underlying evolu- 
tionary process 

o It implements automatic load balancing. Workers that represent more powerful 
computers, or those that receive simpler individuals can evaluate more of them 
as they can access the space concurrently, 
o Workers can be added and removed whilst the evolutionary process is running, 
o Workers can be executed on many different hardware and software platforms be- 
cause they are implemented in JAVA, which is a platform independent language, 
o Workers do not steal any computational resources as long as no individuals are 
written into the space. They could therefore run/sleep on all the computers of an in- 
stitution at all times. A policy could be established that individuals are only written 
into the space outside normal working hours. In this manner all the computer power 
of an institution could be harvested with virtually no cost, since the computers are 
often not switched off and JavaSpaces is free of charge, 
o The programming language JAVA also offers what is known as dynamic class 
downloading [23], which enables workers to evaluate individuals originating from 
different evolutionary algorithms that are tackling different problems. As long as 
the object (a particular individual) implements a specific interface (e.g. Spacelndi- 
viduallnterface), a worker is capable of downloading the necessary class files from 
a web server in order to evaluate (execute) this individual. This means that such a 
parallel evolutionary algorithm can be reused (without shut-down) or could even be 
used by several users, running different evolutionary algorithms at the same time. 

However, the presented approach has one disadvantage. It does not fully exploit the 
power of the utilised computers. This is due to the idle times of the master and the 
workers. The master has to wait until all the workers have returned the evaluated indi- 
viduals and the workers are idle until the master initiates the fitness evaluation again. 
Asynchronous master-slave and coarse-grained PEAs can be used to overcome these 
problems. The implementation of these PEAs using the JavaSpaces technology is ex- 
plained in the next sections. 

3.2 The Asynchronous Master-Slave Parallel Evolutionary Algorithm 

The asynchronous master-slave PEA has the same structure as the synchronous master- 
slave PEA and was also put forward by Grefenstette in 1981 [11]. In order to tackle 
the afore-mentioned ‘idle time problem’ of the synchronous master-slave PEA, a non- 
generational EA is executed on the master. Individuals are written into the space as 
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soon as they require evaluation (e.g. after a mutation). Therefore workers are constantly 
supplied with individuals and the master can proceed with other evolutionary processes 
as soon as a sufficient number of individuals have been taken from the space 2 . This 
results in lower idle-times for both workers and the master. 

3.3 The Coarse-Grained Parallel Evolutionary Algorithm 

The coarse-grained PEA has the same structure as the synchronous master slave ap- 
proach. However, each worker runs an evolutionary algorithm instead of just evaluating 
individuals. The workers have to be started first. When a worker has started, it writes an 
object containing its IP-address into the space. After all workers have started, the mas- 
ter is started. It removes all objects from the space. The master then writes an object 
containing all the accumulated IP-addresses back in the space. Each worker reads this 
object from the space. This makes the workers aware of each other. 

The master now initialises a population of individuals and writes them into the 
space. Each of these individuals contains an IP-address and, if necessary, the web server 
address of the data. An IP-address is chosen with a uniform probability from the previ- 
ously accumulated IP-addresses. A worker reads only those individuals from the space 
that contain its IP-address. 

Each individual also contains a counter. Each time an individual passes through the 
evolutionary algorithm of a worker, this counter is increased. If a counter reaches a 
maximum value (number of generations), an indicator is added to the individual and it 
is written in the space. This indicator makes it impossible for a worker to take such an 
individual from the space again. The master can only take an individual from the space 
if it contains this indicator. This enables the master to accumulate the final population. 

If an individual’s counter has not yet reached its maximum value, a process within 
the worker decides whether or not a particular individual is sent (migrated) to another 
worker or whether it remains with the worker. An individual only migrates with a 
specific probability (migration probability). To migrate an individual the original IP- 
address is removed from it. After this, another worker’s IP-address is attached to it be- 
fore it is written into the space. This IP-address is sampled with a uniform probability 
from the other workers IP-addresses. 

An interesting property of this implementation of a coarse grained PEA approach, 
is that its behaviour only depends on two parameters: the migration probability and 
the number of workers. The approach is also topologically independent because all 
workers are virtually interconnected via the space 3 . This makes this implementation 
much easier when compared to other approaches. The migration probability controls 
the communication costs. If a low migration probability is chosen, the approach can be 
expected to be faster than the synchronous master-slave approach. 



2 The sufficient number of individuals depends upon the number of individuals that are necessary 
for the selection process. In this case binary tournament was deployed, which only requires two 
individuals. 

3 Different topologies can be implemented by only supplying each EA with particular IP- 
addresses. 
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4 Methodology and Results 

To investigate the scalability of the synchronous master-slave PEA (parallel implemen- 
tation), it is first compared against a serial implementation, which performs the fitness 
evaluation locally. After this, the proposed parallel implementations are compared with 
each other, utilising different numbers of workers. 

The master and the serial implementation were executed on a PC with the operating 
system Windows 2000, a single Intel Pentium 4 2.6 GHz processor, and 1GB memory. 
The workers were executed on PCs with the operating system Windows 2000, a single 
Intel Celeron 2 GHz processor, and 256MB memory. The JavaSpace and the web server 
ran on one PC running the operating system Linux (RedHat 9.0). This machine had 
a single AMD Athlon 1.3GHz processor and 512MB memory. The latest JavaSpaces 
version that comes with Jini 2.0 was used 4 . All machines were connected via a 100 
Mbit switched network. 

For the first experiments several synthetic data sets were created, each containing 
between 1000 to 10,000 samples. A population of 100 individuals was evaluated 10 
times on each data set. The average evaluation time and standard deviations were com- 
puted. Figure 2 depicts the results of the first experiment. 




Fig. 2. Comparison between the serial implementation (dotted line) and the JavaSpace implemen- 
tation (solid line) utilising five slave computers. 

It can clearly be seen that the parallel implementation outperforms the serial im- 
plementation for data containing more than 1000 samples. Data sets of this size are 
not uncommon nowadays. Figure 3 compares several synchronous master-slave PEAs, 
which utilise different numbers of workers. 

4 It can be downloaded free of charge from 
http://wwws.sun.com/software/communitysource/jini/download.html 
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Fig. 3. Comparison between several parallel implementations utilising different numbers of work- 
ers (4 workers - solid line, 8 workers - dashed line, 16 workers - dotted line). 



A data set can contain between 10,000 to 200,000 samples. It can clearly be seen 
that the utilisation of more workers results in a faster evaluation time for these data sets. 
Figure 4 depicts the achieved ratios of the decrease in evaluation time. The ‘4 workers’ 
parallel implementation is compared with the ‘8 workers’ and ‘16 workers’ parallel 
implementation, whereas the ‘8 workers’ parallel implementation is compared with the 
‘16 workers’ implementation. Ratios of the decrease in evaluation time of 2, 4, and 2 
are expected respectively. 

Figure 4 shows that the deployment of 8 workers results in a decrease in evalua- 
tion time by a factor of 2. One would expect the same when 8 instead of 16 workers 
are deployed. Unfortunately, this is not the case. The reason for this might be extra 
communication costs due to the existence of a larger number of workers. This is also 
emphasised by the fact that the deployment of 16 workers instead of 4 workers only 
results in a speedup of about 3.5. 

Three experiments were performed to compare the three PEAs. For each experiment 
a PEA was run 100 times. For each run a data set containing N rows was produced. 
The value for N was determined randomly through uniform sampling from the interval 
[6000 . . .200000]. An EA run terminated after the selection process generated 1000 
individuals. This termination criterion enabled a comparison of generational and non- 
generational approaches. This termination criterion also works for the coarse-grained 
PEA, although it has several selection processes. As each individual is equipped with 
a counter, it enables one to count how many individuals all selection processes have 
produced globally. 

Individuals had a fixed size to keep the communication costs constant. A migration 
probability of 0. 1 was used for the coarse-grained PEA. This means that (on average) 
every tenth individual is migrated to another EA. 
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Fig. 4. Ratios of the decrease in evaluation time (8 vs 16 workers - dotted line, 4 vs 8 workers - 
solid line, 4 vs 16 workers - dashed line). 



The three experiments utilised the following hardware and software settings. All 
workers (EAs in the case of the coarse-grained PEA) were executed on PCs with the 
operating system Linux, a single AMD Athlon 1.3 GHz processor and 512 MB mem- 
ory. The master implementations were executed on a PC with the operation system 
Windows 2000, and a single AMD Athlon 850 MHz processor and 256 MB memory. 
The JavaSpace was executed on a machine with the same specification as that of the 
workers. The JavaSpace implementation that comes with Jini 1.4 was used. The utilised 
web server was a HP RP 2400 machine, with one PA 8500 RISC 440 MHz proces- 
sor, 630 MB memory, and Apache 1.3.6 and the HP-UX 11.00 operating system. All 
computers were connected via a 100 Mbit switched network. It is important to note 
that other users could have used the machines (except the master) while the experi- 
ments were performed. Therefore the results show that the approach can be used in a 
multi-user environment. 

As stated above, an experiment consisted of 100 pairs of independent values (num- 
ber of cases) and dependent values (execution time in seconds). In order to compare 
experiments visually, the statistical package SAS 5 was used. For each experiment a re- 
gression line was fitted through the data. A ninety-nine percent confidence interval was 
computed for the mean predicted values for each experiment. Figure 5 shows the results 
when 20 workers are deployed. 

It can clearly be seen that the coarse-grained PEA outperforms the synchronous 
master-slave PEA for more than 3,000 rows. The coarse-grained PEA outperforms the 
asynchronous master-slave PEA for more than 100,000 rows. However, the spread of 
data points also indicates that the execution time of the coarse-grained PEA can vary. 

5 SAS is a trademark of SAS Institute Inc. 
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Regression Equation: 

T i me( Group : I ) = 9.183998 ♦ 0 . 0 00H I ExCases ♦ t . 05 1 E-9*Csses~2 

T i me( Group: 2 ) - 2.99026 + 0 . 0 0 037H*Cases * 8 .32E- 1 0*Cases A 2 

T ime( Group : 3 ) = -0.06916 + 0 .000372*Cases ♦ 5 . 9E- 1 0* Cases'^ 

Fig. 5. Comparsion between different PEAs (Group 1: 20- worker synchronous master-slave PEA, 
Group 2: 20-worker asynchronous master-slave PEA, Group 3: 20-worker coarse-grained PEA). 



5 Conclusions and Future Work 

This paper has shown that the JavaSpaces technology can be used to improve the per- 
formance of evolutionary algorithms in a simple and affordable manner. Several ap- 
proaches for the implementation of different parallel evolutionary algorithms were pro- 
posed and successfully tested using this technology. It was also shown that the PEA 
implementations perform well in multi-user heterogeneous computer networks. As ex- 
pected, the coarse-grained PEA outperformed the synchronous and asynchronous mas- 
ter slave PEAs. This is most probably due to its low communication costs. The coarse- 
grained PEA is capable of evolving the same amount of individuals in a shorter period 
of time compared to the other PEAs. However, the question of whether or not the indi- 
viduals that were evolved by the coarse-grained PEA perform as well as those evolved 
by the other PEAs has not been resolved. Further studies are required to investigate this 
issue. 
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Abstract. In many of today’s domains data is of vital importance. It helps give 
an indication of what has occurred and why. In some cases this can be obvious 
however, this is not always the case. This process of extracting useful knowl- 
edge from data can be enhanced by using techniques like machine learning and 
data mining. Given the speed and ease of these computerised techniques, the 
more beneficial it becomes to store and analysis data which may contain vital 
knowledge. Medical domains have been using a variety of machine learning 
tools for some time, such examples as: classification trees, neural networks. 
These allow an algorithm to analysis data in relation to a concept and learn and 
present knowledge specific to that concept. However, the next challenge in this 
area is not only to automate the learning of these concepts, but to adapt to 
changes in them i.e. concept drift. These changes can occur over time: different 
stages of disease progression, or it may be triggered by some controlled change: 
changes in a drug being used. The changes can be immediate (revolutionary) or 
over a period of time (evolutionary). Changes in data with automated learning 
can now be handled by the algorithm CD3. CD3 will detect drift, highlight spe- 
cifically which sections of the knowledge have changed/drifted and remove the 
examples that are no longer valid. This is all done automatically without any in- 
tervention from the user. 



1 Introduction 

A common inference task consists of making discrete predictions about a concept, for 
example a diagnosis. This prediction problem is referred to as the classification prob- 
lem. The task of a classification algorithm is to accept a set of training examples 
which will depict the current state of knowledge for that concept. These training ex- 
amples are a set of descriptive attributes with an associated class. This class repre- 
sents a value for the concept. The training examples can be the combination of data 
from patient records and expert knowledge. For example it is usually the expert who 
allocates the concept class for each patient record, i.e. if at disease level 1 or 2. The 
algorithm will induce a knowledge structure to distinguish between the values of the 
concept. A tree induction algorithm will produce a classifier in the form of a tree from 
which rules can be interpreted as one for each path from the root of the tree to each 
leaf. These rules are the knowledge/hypothesis which represent the concept. 

This induced structure can then be used to classify/predict new unseen examples 
where the value of the concept is not known. For example the induced structure can 
be used for diagnoses, identify risk factors [3,13,14], prediction of signal peptides 
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[18], or any form of discriminatory function. It can even be used just to generate a 
hypothesis about the concept to ensure nothing of information value has been uniden- 
tified [ 1 1,15,18]. This is particularly important when there are many features and or 
many records. 



2 Examples of Classification with Clinical Data 

A lot of work has already been carried out in the medical domain using machine 
learning techniques [5,13,14,15,18,21]. Work carried out by the authors [3], gives an 
insight into how classification trees can be used to find some interesting knowledge 
from data. The work was based on data from the Bioinformatics Centre at University 
of Ulster Coleraine. The data records relate to adult cystic fibrosis (CF) sufferers in 
Northern Ireland. Cystic fibrosis (CF) occurs in the European Caucasian population at 
a rate of approximately 1 in 2000. These patients suffer from chronic infection. This 
chronic disease state usually leads to decreased lung function, poor nutritional status, 
elevated immune function and raised oxidative stress. The type of bacterium that the 
host is infected with seems to affect morbidity and mortality. Infection of CF patients 
by Burkholderia cepacia (BC) has been shown to lead to clinical decline. The occur- 
rence of BC has increased greatly since it was first isolated in CF patients. Approxi- 
mately 30% of adult CF patients in Northern Ireland are infected. The aim of this 
study was to examine markers of three specific areas: oxidative stress, nutrition and 
disease progression in patients with and without B. cepacia infection. Fourteen CF 
patients with B. cepacia (range 16-30 yr, mean 24.9) and twenty-six non-B. cepacia 
CF patients (range 16-38 yr, mean 26) who were attending the Belfast City Hospital 
Adult Cystic Fibrosis Unit were enrolled for the study. The three specific areas of 
interest were measured by the attributes shown in [3]. The initial objective was to 
induce any correlation between the patient attributes available and the disease level of 
the patient i.e. cepacia or non-cepacia. As there was no specific initial hypothesis 
identified this was a clear induction task. The aim was to induce some information 
from within the attributes in determining the disease level. 

The attributes that were selected for the learning process are shown below. 

Class: C/non-C: 1= cepacia, 2= non-cepacia 
Sex: 1= male, 2=female 
Age: continuous 



Table 1 . Cepecia Data Fields 



1 Nutrition 


2 Morbidity 


Retinol 


Severity Scale 


Lutein 


Av. ACT 


y-Tocopherol 


NE/AAT 


a-Tocopherol 




P-Carotene 




Total protein (Tot ptn) 


3 Oxidative stress 


Total cholesterol (Tchol) 


Protein Thiols 


y-Tocopherol: 




cholesterol 




a-Tocopherol: 




cholesterol 
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The tree induction learning process searched among possible hypotheses that suffi- 
ciently well explained the training instances presented in relation to the class. Induc- 
tion of decision trees is one of the widely used approaches and presents the induced 
classifier as a tree representation [19]. 




A rule is deduced from the tree by reading from the root to a leaf and the class at 
that leaf is that of the rule. The first rule: Tot ptn =< 82.46 and Tchol = < 2.71 -> 1 
can be read as; 'If a patients total protein is less than or equal to 82.46 and the total 
cholesterol is less than or equal to 2 . 71 then this patient is at disease level 1 

The use of inductive classifier in relation to medical data has been applied success- 
fully during these analyses. We can suggest from these results that the measurement 
of total protein is not acting as an indicator of nutrition but rather as an indicator of 
disease state. Total protein would increase in patients with an infectious load owing to 
raised acute phase proteins including ACT. 

Our results for total cholesterol are in keeping with the findings of Corey [4] whose 
landmark paper highlighted the importance of dietary fat intake in the survival of CF 
patients. Without an adequate calorific intake it is probably more difficult for CF 
patients to recover from, or reduce the occurrence of, acute exacerbations. The intake 
of dietary fats is also important for absorption of FSVs, which are known to have 
antioxidant properties. Depressed antioxidant status contributes to increased oxidative 
stress in CF patients [16]. This data analysis technique has also identified ACT and 
age as important factors in both the BC and non-BC subgroups of CF patients. 

There are however, many complications with any domain data but with medical it 
can be that there is too much data or too little, and/or too many or too little features 
[3, 17]. An added complication of noise in the data can also exist [9]. Another key 
problem is data validation [TO]. However, these are not the focus of this paper. 

The experiment discussed above used data from a stationary distribution. However 
this is not always the case. Many medical datasets are incrementally recorded over 
months/years. The following section will discuss a controlled experiment where 
batches of new updated records are gathered over the life time of the experiment and 
the effect this can have on the knowledge/hypothesis induced. 
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3 Incremental Data 

In the example discussed above there existed one batch of examples from which the 
algorithm could extract its knowledge/hypothesis, about the concept. Many of today’s 
domains have a stream of batches of data generated regularly. For example a con- 
trolled experiment where a series of tests are carried out on patients at a number of set 
intervals. Between each interval some parameter within the experiment is adjusted 
and the effect of this on the concept must be identified. 




Fig. 2. Sequence of Incremental data along a time line 



The rules which were induced using Batch 1 as training data, can be used as a clas- 
sifier, classifying unseen examples. However, these rules will only remain valid as 
long as things within the trial remained unchanged. Again this could be changes due 
to time with the patients: disease progression, or some experimental parameter 
changes within the trial [13]. Once a change has occurred and the results of this are 
recorded the rules generated from Batch 1 only, may no-longer be valid. Some of the 
rules which represent the concept may have changed, i.e. concept drift. (This will be 
discussed in detail in the next section) Therefore once the next batch of new records, 
Batch 2, are available they must be incorporated into the initial data and any changes 
that have occurred must be highlighted and incorporated so that the classifier remains 
up to date and valid. An out of date classifier could be catastrophic [1], The CD3 
algorithm allows this mechanism to happen. It will accept any number of batches 
along the life line of a trial. As each batch arrives the induction process is updated and 
a new tree is generated. It will also highlight any changes in the knowledge between 
batches via a special methodology [1,8]. This is then used to purge out of date data 
and the classifier is updated to include only valid rules about the concept. 

The effects of batches and the changes that can occur between or within them is 
discussed in detailed in [1]. 



4 Concept Drift 

In today’s world nothing remains static. The world around us is adapting/evolving all 
the time. This is very evident is the field of biomedical science [14]. In many applica- 
tion areas where databases are being mined for classification rules, it is reasonable to 
suppose that the underlying rules will be subject to concept drift [1,2,8,22,23] due to 
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this adaption or evolution. By concept drift we mean that some or all of the rules 
defining a concept change as a function of time. 

A simple hypothetical example of a domain where this could occur is a controlled 
experiment where the experimenter is following a rigid test plan of changes on fea- 
tures and their effect on the class outcome, the disease state is being recorded. Here 
change may be prompted by the change in level of an individual (or a combination of) 
nutritional minerals. For example disease progression in current patients being differ- 
ent from past patients due to changes on features provided from an experimental trial. 

To see how drift can affect rules, consider hypothetical rules from the cystic fibro- 
sis example discussed above. Suppose that as part of an incremental experimental trial 
nutritional levels were being increased. The classes remain the same as before: 

class 1: likely to be in disease state 1 

class 2: likely to be in disease state 2 
A rule from the induced tree classifier after batch 1 could read as follows: 

if Totptn=<82 . 46 and Tcholage=<2 . 71 

then class 1 

After that batch, nutritional parameters changed for some patients, a drifted rule could 
be read as follows: 

if Totptn=<82 . 46 and Tcholage=<2 . 71 and 
Plasma fat-soluble vitamins > 0.5 

then class 2 

Drift in rules may affect only part of the knowledge induced. For example in case of 
the controlled experiment, there may be different hypothesis for the same disease state 
(different branches on the tree for the same class), some of which remain more or less 
static over time whilst others are highly dynamic and affected by drift. 

Drift may happen suddenly, referred to as revolutionary, or may happen gradually 
over an extended time period, referred to as evolutionary [1], In the former case we 
refer to the time at which drift occurred as the drift point. We can regard evolutionary 
drift as involving a series of separate drift points with very small drift occurring at 
each one. 

The extent of prior knowledge about when drift is likely to occur depends very 
much on the application. For example in the controlled experiment a slight change to 
one parameter may have a very slow reaction and this can obviously be anticipated. 
The drift point will probably occur shortly after the change in the parameter. Other 
examples of immediate change may follow after an operation or before with coronary 
heart disease [14] where the change may be more immediate. In contrast there will be 
domains in which it is not clear, even retrospectively, when drift might be present. 

In spite of the above arguments, there is, in much of the current work in machine 
learning, a tacit assumption that knowledge being mined for is static [14]. Yet often 
the data source for induction has been collected over a considerable period of time 
making it more probable that drift has occurred on some occasion(s). In machine 
learning there has been some work for more than a decade on concept drift for classi- 
fication; see, for example [7,20,22,23]. 

The solution we proposed in [1] was to augment the data records with a time stamp 
and to actively use this in the induction process as a description attribute. Drift is then 
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indicated by the time stamp attribute becoming a relevant attribute in the knowledge 
structure acquired during induction. We refer to this principle as TSAR (Time Stamp 
Attribute Relevance). Here we refer to any induction procedure that seeks to detect 
concept drift by this means as a TSAR procedure or algorithm. The simplest way to 
create such a procedure for induction is to augment an existing classification learning 
algorithm, referred to as the base learner. 

Before a TSAR procedure can be applied, the granularity of the time stamp must 
be decided upon. In [1] we considered a time stamp associated with a new batch, i.e. 
within the batch all time stamps were identical. We proposed an incremental concept 
drift detecting learning system, CD3 (CD =concept drift) which used ID3 and together 
with a post-pruning algorithm as its base learner (see figure 1). Only time stamps 
'current' and 'new' were used. CD3 maintains a database of valid examples from the 
totality of data presented to the system in its lifetime. Each example in this current 
example set carries the time stamp value 'current'. When a new batch of data arrives, 
its examples are all time stamped 'new'. CD3 then learns classification rules from the 
combined current example set and the new batch using ID3 together with post prun- 
ing. 

Induction with the TSAR approach produces a knowledge structure in which indi- 
vidual rules may be tagged with, i.e. have as part of their antecedent, a time stamp 
'current', a time stamp 'new' or no stamp at all. Any rule having time stamp value of 
'current' must be out of date since here 'current' refers to the situation before the pre- 
sent round of mining. Such rules are called invalid while the remaining rules are 
valid. The final stage in a CD3 cycle involves updating the database of current exam- 
ples by first purging it of those examples that match invalid rules and then adding the 
new batch to it after re-labelling its examples as current. This process is explained at 
greater length in [1], 

Valid rules are used for the on-line classification task until they are further updated 
after the next round of mining. 

We demonstrated in [1] that the CD3 algorithm could detect revolutionary and 
evolutionary patterns of drift against a substantial background of noise and recover 
swiftly to produce effective classification. This was in marked contrast to the use of 
ID3 where time stamps were not used. 

In [8] additional detailed discussions on refining the time stamps used are available 
and their effect on the learning process. In [2] some experiments were carried out on 
telecommunication customer data which contained drift. The results show the success 
of how the tree highlights when drift has occurred and the section of knowledge af- 
fected by drift. 



5 Future Work 

Concept drift has already been applied to a wide variety of domains such as: telecoms 
[8], user profiling [6], Some work has been carried out in clinical studies [14]. This 
work reported on the complications that the user had in setting parameters in the ex- 
periment and model rebuilding. Some of the systems discussed by Kukar [14] re- 
moved examples as the learning became unstable and then relearns the knowledge 
again. CD3 requires no intervention by a machine learning expert. The classifier can 
be automatically updated with new examples and the induced knowledge structure 



Detecting and Adapting to Concept Drift in Bioinformatics 167 



New Batch 




Valid Rules 




Unclassified 

Examples 



Classifier 



Classified 

Examples 



will be updated if drift occurs. CD3 will not remove examples that are still valid. It 
will only purge things that have changed and become invalid. The examples that rep- 
resent this can be removed and stored for later analysis. CD3 proved very successful 
with the drift in the telecoms data [2], therefore the next challenge is too identify 
biomedical areas which have data that may be susceptible to change and employ CD3 
as a learner. The possible drifting concepts must be identified and the relevant data 
analysed for preparation of the concept drift study. Once the classes have been identi- 
fied and data collated and time stamped, the CD3 can then detect and present any 
findings. The nature and speed of the change will be made apparent by the algorithm 
[1]. Because changes are detected and purged it is possible to store these changes as 
they occur and carry out trend analysis over time. This can become very important 
under controlled experiments where batches of data are being generated: batches with 
slight changes to parameters. Changes (or lack of them) in respect to the class(es) can 
be recorded against the controlled change. 

Other areas of interest within medical data would be to analysis patient records 
from a specific geographical region in relation to a disease. CD3 could analysis 
batches of patients records, each batch being a different geographical region. The 
algorithm could then be used to indicate if patients from different regions created 
different knowledge about the disease by highlighting concept drift. 
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Abstract. The auditory brainstem response (ABR) has become a routine clini- 
cal tool for hearing and neurological assessment. In order to pick out the ABR 
from the background EEG activity that obscures it, stimulus-synchronized aver- 
aging of many repeated trials is necessary and typically requires up to 2000 
repetitions. This amount of repetitions could be very difficult and uncomfort- 
able for some subjects. In this study a method based on wavelet analysis is in- 
troduced to reduce the required number of repetitions. The important features of 
the ABR are extracted by thresholding and matching the wavelet coefficients. 
The rules for the detection of the ABR peaks are obtained from the training data 
and the classification is carried out after a suitable threshold is chosen. This ap- 
proach is also validated by another three sets of test data. Moreover, two proce- 
dures based on Woody averaging and latency correlated averaging are used to 
preprocess the ABR, which enhance the classification results. 



1 Introduction 

The electroencephalogram (EEG) is a plot of the electrical potentials within the brain, 
which can be recorded from electrodes placed on the scalp of the subject. It can be 
influenced by sensory stimuli. The auditory evoked potentials (AEPs) are the electri- 
cal responses related to the auditory stimulus [1], The auditory brainstem response 
(ABR) is the early component of the overall AEP and occurs within 10~20ms of the 
stimulus. Since arousal and attention, drowsiness, or the effects of drugs do not affect 
the ABR, they are useful as a hearing screening method, particular for infants and 
newborns. 

The ABR includes waves I- VII of which I-V are usually investigated, V often be- 
ing the most prominent one. An example of the ABR is shown in Figure 1, where 
waves I, II, III and V are shown as labelled peaks. In this example wave IV is not 
evident. There are some factors that can affect the ABR, such as stimulus parameters 
and subject effects [2]. Stimulus intensity is one important factor that has an effect on 
the response. In general, a decrease in stimulus intensity is associated with an increase 
in ABR wave latencies and corresponding decrease in amplitude. Gender is one of the 
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subject effects with females tending to have shorter latency and larger amplitude than 
males. All these factors must be taken into consideration when interpreting the ABR. 




Fig. 1 . An example of the ABR obtained from a healthy adult with wave /, wave II, wave III 
and wave V marked. The acoustic stimulus is triggered at time 0 

The ABR has very small amplitudes and the background EEG activity which can 
be 10 times greater largely obscures its characteristic features. In order to pick out the 
ABR from the ongoing background EEG activity, stimulus-synchronized averaging of 
many repeated trials is necessary. Since the ABR tends to occur at almost the same 
time after the stimulus on each trial, the shape of the waves will be enhanced when 
many trials are averaged together. On the contrary the background EEG waves are 
unrelated to the stimulus, so they are suppressed after being averaged. This conven- 
tional averaging procedure typically requires up to 2000 repetitions, which can be 
very time consuming and uncomfortable for some subjects. Therefore, reducing the 
required number of trials to reduce the duration of the test is important for both clini- 
cians and patients, especially for children and non-cooperative adults. 

Conventional averaging assumes that the ABR repeats exactly for each stimulus 
application. However, it has been known that changes in the ABR occur from one 
stimulus application to the next, such as the variances in the latencies of waves [3], If 
the changes of latency can be determined the temporal position of each trial can be 
adjusted in order to best align it with a peak in the average ABR. This procedure 
should provide a substantial improvement in signal to noise ratio (SNR), and conse- 
quently reduce the number of trials needed to provide a clear response. Woody [4] 
devised a method for measuring the waveform of variable-latency EPs using this type 
of preprocessing. It crosscorrelates each trial with an appropriate template and deter- 
mines the time displacement for which the crosscorrelation is a maximum. Each trial 
is then shifted in latency and an average of the aligned waveforms is computed. The 
process is repeated using this new average as the template and a second average is 
computed. The template is then continually adapted as more trials are processed until 
the average crosscorrelation between the current average and data set does not change 
significantly from that of the previous one. 

Woody averaging has several limitations because it assumes that the entire signal 
shifts in latency as a function of stimulus application. If different components with 
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each trial have latency changes that are independent of each other, as appears to be 
true in most practical situations, neither conventional averaging nor Woody averaging 
will be optimum representations of the signal. In order to overcome the shortcoming 
of Woody averaging, latency correlated averaging is proposed, which allows for the 
latencies of the peaks of components in the same waveform to vary independently of 
each other [5]. Aunon [6] gave the comparison of these three averaging techniques for 
processing EPs. 

Wavelet analysis is a signal processing tool which has been developed for multi- 
scale representation and analysis of signals [7]. Wavelet analysis offers the ability to 
break the signal down into its component scales (frequency ranges). One area of its 
applications is biomedical engineering. Unser et al. [8] have given a general discus- 
sion of the use of wavelets in biomedical applications, and the potential benefit of 
wavelet analysis of the AEP has also been reflected by a number of reports. The dis- 
crete wavelet transform (DWT) is the most widely used WT algorithm for multi- 
resolution AEP analysis in the literature because of its computational efficiency, and 
it provides a sparse time-frequency representation of the original signal that has the 
same number of samples as the original. The first application of the wavelet transform 
to brain signal analysis, especially for the analysis of the evoked potentials, was de- 
scribed by Bartnik et al [9]. Samar et al. [10] also presented the general concepts 
underlying wavelet analysis of the event-related potentials and designed the matched 
Meyer wavelet which identified the prominent IV-V complex of the ABR used widely 
for clinical evaluation of hearing loss [11]. Other examples include improved wavelet 
morphology and the detection of auditory brainstem responses [12, 13], late auditory 
evoked potentials [14], and auditory P300 event-related potentials [15, 16]. These 
wavelet methods differ from the traditional Fourier techniques by the way in which 
they localize the information in the time-frequency plane. Since the wavelet analysis 
has a varying window size, being wide for low frequencies and narrow for the high 
ones, it leads to an optimal time-frequency resolution in all the frequency ranges. 

Although the wavelet transform has been used extensively for EP/AEP analysis, 
very little work has been done on the ABR and not much on the classification or the 
reduction of the required repetitions of the ABR. This paper will develop an approach 
of feature extraction based on wavelet analysis and classification of the ABR using 
fewer repetitions. In addition, in order to deal with the variance in the latency of the 
ABR two methods, based on Woody averaging and latency correlated averaging, are 
used to preprocess the ABR. 



2 Wavelet Based Classification 

2.1 The ABR Data 

The ABR is recorded in an acoustically quiet room and the subject either reclines in a 
comfortable chair or lies on a bed with electrodes placed on the surface of the scalp. 
In order to measure the ABR, a series of acoustic stimuli is presented to the ear by the 
sensory stimulator. In this study one set of data from a healthy female adult is used as 
the training data to get the classification rules and the classification thresholds. Then 
another three sets of test data from three healthy adults, patient 1 (female), patient 2 



172 Rui Zhang et al. 



(male) and patient 3 (male), are used to validate this approach. The stimulus intensity 
of the ABR is 70dB and the ABR is recorded from 10ms before the stimulus to 10ms 
after the stimulus, with 200 data points sampled in the pre-stimulus period and 200 
post-stimulus. It is assumed that for normal hearing subjects, a response is present 
post-stimulus where an acoustic stimulus has been present. However in order to col- 
lect data with a missing response recordings are also made with an absent stimulus at 
t=0. Thus the complete set includes response and no response data in the post- 
stimulus period. 64 repetitions and 128 repetitions are averaged for the investigation 
of both the pre-stimulus and post-stimulus activity. 

2.2 Applying the Wavelet Transform to the ABR 

The wavelet transform (WT) gives a time-frequency representation of a signal that 
has many advantages over previous methods, such as an optimal resolution both in the 
time and frequency domains and the lack of requirement for the signal to be station- 
ary. A number of wavelet functions exist, each having different characteristics. The 
optimal wavelet itself is open to debate. Wilson [17] suggested utilizing a smooth and 
symmetrical mother wavelet, and in this study the biorthogonal 5.5 (Bior 5.5) is cho- 
sen as the mother function of WT and a 5 level DWT is performed by the Matlab 
software. Thus 5 detail scales (D1-D5) and a final approximation (A5) with different 
numbers of wavelet coefficients at each scale are obtained. The lower scales give the 
details corresponding to the high frequency components of the signal and the higher 
scales corresponding to the low frequency ones. Figure 2 is an example of the ABR 
and its five-level decomposition using the discrete wavelet transform. 




Fig. 2. An example of the ABR and its 5-level decomposition. The frequency component of the 
ABR is decomposed into a low frequency component (approximation A5) and 5 higher fre- 
quency components (detail Dl, D2, D3, D4 and D5 ) 
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2.3 Feature Extraction 

Thresholding the Wavelet Coefficients. As quite a number of wavelet coefficients 
of varying amplitudes are obtained at all scales, this raises the issue of wavelet coeffi- 
cient selection, which wavelet coefficients to keep and which to eliminate. Quiroga 
[15, 16] has proposed a method to choose the wavelet coefficients which covers a 
time range in which the EP is expected to occur. Generally the larger wavelet coeffi- 
cients are caused by the responses to the stimulus and they are useful for the classifi- 
cation of the ABR, while the smaller ones are caused by the background EEG activity 
and are less important. Therefore the larger wavelet coefficients are thresholded at 
each scale. Since the final approximation (A5) provides the base of the ABR, all the 
A5 wavelet coefficients are kept and for the 5 detail scales (D1-D5) the larger wavelet 
coefficients, the top 20%, are retained. After thresholding the wavelet coefficients at 
each scale, the background EEG activity is greatly suppressed. Although the larger 
wavelet coefficients may be retained as a result of the stimulus, they can also be pro- 
duced by excessive background EEG noise, therefore a further matching procedure is 
then performed. 

Matching the Wavelet Coefficients of the ABR to the Template. The above 
thresholding procedure is applied to take the thresholded wavelet coefficients and 
their locations for each of the 64 and 128 averaged ABRs as well as the grand aver- 
aged ABR of the training data. As mentioned, the grand averaged ABR has better 
SNR by conventional averaging so that its thresholded wavelet coefficients, which are 
retained using the previous method, are almost certainly caused by the response to the 
stimulus. So the grand averaged ABR of the training data is used as the template and 
compared with the locations of the thresholded wavelet coefficients of the 64 and 128 
averaged ABRs at different scales. For each of the thresholded wavelet coefficients of 
the 64 and 128 averaged ABRs, it will be retained if it matches the template, in other 
words, the template also has a thresholded wavelet coefficient at the same location. If 
not, this thresholded wavelet coefficient is possibly caused by the background EEG 
noise and should be eliminated. 

After matching the wavelet coefficient, a further investigation is to see if there are 
some common locations of the thresholded wavelet coefficients among the ABRs 
with responses, which do not exist among the ABRs with no response, so that their 
difference in nature can be explained. 

2.4 Rule Extraction 

By comparing the common locations it is found that some locations appear frequently 
at certain scales among all the ABRs with responses, while this is infrequent among 
the ABRs with no response. Moreover these locations correspond to the well-known 
pattern for such ABR data where the terminology wave I, wave II, wave III, wave IV 
and wave V are used to describe the pattern. Especially three locations appear most 
frequently and it is assumed that the most frequent occurring ones will indicate the 
approximate locations of waves I, III and V since these three waves can be most eas- 
ily found in the ABR if there is a response present. After studying the training data, 
the locations that appear most frequently are found to be Xl=22, X2=78 and X3=118, 
where XI, X2, X3 are data points. Since the latencies of the ABRs are different from 
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one to another, a time range in which the peaks might shift is considered. Y= 1 6 is 
chosen as the half-length of the time range, since it is the minimum distance between 
the wavelet coefficients at scale D4, D5 and A5. Previous research on the study of the 
ABR shows that its frequency content consists of 3 peaks at approximately 200, 540 
and 980Hz. Component 1 is the low frequency components at 200 Hz or below, 
which is thought to possibly contribute to wave III and wave V and the trough that 
follows wave V. Component 2 is a middle frequency component in the 500- 600 Hz, 
which is thought to contribute to waves I, III and V. Component 3 is the high fre- 
quency component in the 900 - 1 100Hz, which is thought to contribute to the waves I, 
III and V [18]. Therefore, for the detection of wave I, scale D4 (625- 1250Hz) and 
scale D5 (315-612Hz) are searched and for the detection of wave III and wave V, 
scale D4 (625- 1250Hz), scale D5 (315-612Hz) and the final approximation A5 (0- 
313Hz) are searched. Then three rules for the detection of waves I, III and V are ex- 
tracted which are shown in Table 1. The more the ABR meets the rules, the more 
probable it is that a response is present. 

Table 1. Three rules for the detection of waves I, III and V at different scales 



Conditions Conclusions 





If at (scale D4) or (scale D5) 




Rulel 


the location for any of the thresholded wavelet 
coefficients is in the interval [Xl-Y, Xl+Y] 

If at (scale D4) or (scale D5) or (scale A5) 


Wave I present. 


Rule2 


the location for any of the thresholded wavelet 
coefficients is in the interval [X2-Y, X2+Y] 


Wave III present. 


Rule3 


If at (scale D4) or (scale D5) or (scale A5) 
the location for any of the thresholded wavelet 
coefficients is in the interval [X3-Y, X3+Y] 


Wave V present. 



The three rules above are applied to the training data and because two scales are 
being searched for wave I, the rule score of rulel will be recorded out of 2. Also the 
rule scores of rule2 and rule3 will be recorded out of 3 because three scales are being 
searched for them. For example, for rulel, if the location appears in the interval [XI- 
Y, Xl+Y] at both scales, the rule score will be 2/2=1, but if it is at only one level, 
then 1/2. Then for each ABR, the three rule scores of rulel, rule2 and rule3 are added 
and the sum is used for the classification. 



2.5 Learning the Classification Threshold 

After applying the rules, the scores of the ABRs with no response tend to be smaller 
than the ABRs with responses. A t-test is applied to the two sets of rule scores and it 
is found that there is a significant difference, and this may be used to classify the 
ABRs. In order to carry out the classification a threshold is chosen. The normal distri- 
butions of the rule scores are used to get the classification threshold. The normal 
probability density function is 

f(x) = ,2a (— °° < x < °o ) 

crV 2k 



( 1 ) 



Feature Extraction and Classification of the Auditory Brainstem Response 175 



where fi and a (o>0) are parameters representing the population mean and standard 
deviation. From Equation (1) the normal probability density function for 

the ABRs with responses and the normal probability density function /(x,// 2 ,ct 2 ) for 
the ABRs with no response are obtained: 






a, fix 



(*-//,) 2 / 2 < x , 2 



(— oo < JC < oo ) 



( 2 ) 



/(x,//,,cr 2 )= 1 e (x ‘“ 2> 1201 (— °° < x < °°) (3) 

<x 2 V2;r 

In order to find the cross-over value for the two distributions, the value of x is cal- 
culated for which 

f{x,jU l ,(J 1 )=f{x,jU 2 ,CJ 2 ) (4) 

and the solution between and /u 2 is chosen as the threshold for the classification. 
Figure 3 is the plot of the normal probability distributions of the rules scores of the 
ABRs with responses and the ABRs with no response, and x is the value at which the 
probability data are equal to each other. From the plot, it can be seen that in ABR 
with no response the probability of the scores is due to ongoing background EEG 
activity which has characteristics similar to that which might be obtained from the 
response to an actual stimulus. 




Fig. 3. The normal probability distributions of the ABRs with responses and the ABRs with no 
response. At x, probability data of Normal (/q, of) = probability data of Normal (n 2 , G f) 



The rule scores for the 64 and 128 averaged ABRs are different so that two classi- 
fication thresholds are obtained, x=\ .64 is the threshold for the classification of the 
128 averaged ABRs and a- 1 .49 is the threshold of the 64 averaged ABRs. 



2.6 Classification of the Training Data 

A classification is carried out for the training data after the classification threshold is 
calculated. If the rule score is larger than the classification threshold it is concluded 




176 Rui Zhang et al. 



that there is a response in the ABR and if it is smaller, there is no response. In other 
words, it shows whether the subject can hear or not. Thus the classification accuracy 
of the training data is obtained. A block diagram of this approach is illustrated in 
Figure 4. 



ABR 



Wavelet 

Decomposition 



Feature 

Extraction 

and 

Classification 



r 



{ 







Fig. 4. A flowchart of applying the feature extraction and classification approach to the training 
data 



For the 128 averaged ABRs with no response, all of them are correctly classified 
and for the 128 averaged ABRs with responses, 9 of them are correctly classified and 
2 are incorrectly classified. For the 64 averaged ABRs with no response, 18 of them 
are correctly classified and 4 are incorrectly classified. For the 64 averaged ABRs 
with responses, 19 of them are correctly classified and 3 are incorrectly classified. 
The classification results of the 128 averaged and 64 averaged ABR are shown in 
Table 2 and Table 3 respectively. 

The results for the training data are promising, especially for the 128 averaged 
ABRs. The 64 averaged ABRs get the lower classification results because the ABRs 
are more obscured by the background EEG noise and the SNR in such situations is 
very poor. 

Table 2. The numbers of the 128 averaged ABRs with no response and with responses cor- 
rectly and incorrectly classified along with the classification accuracy 





The ABRs 
with no response 


The ABRs 
with responses 


Correctly Classified 


11 


9 


Incorrectly Classified 


0 


2 


Classification Accuracy 


91% 
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Table 3. The numbers of the 64 averaged ABRs with no response and with responses correctly 
and incorrectly classified along with the classification accuracy 





The ABRs 
with no response 


The ABRs 
with responses 


Correctly Classified 


18 


19 


Incorrectly Classified 


4 


3 


Classification Accuracy 


84% 





2.7 Validation 

In order to evaluate the classification rules and the classification thresholds which are 
obtained from the training data, they are applied to three sets of test data from pa- 
tient 1, patient 2 and patient 3 and the results of the classifiction are shown in Table 4. 

Table 4. The classification results of the 128 and 64 averaged ABRs of three sets of test data 





The 128 averaged ABRs 


The 64 averaged ABRs 


Patient 1 


86% 


73% 


Patient 2 


71% 


70% 


Patient 3 


71% 


64% 



In Table 4, the classification accuracy of the data from patient 1 is much better than 
the other two sets of data. As mentioned, different gender can affect the latency of the 
ABR. The classification rules and the classification thresholds from female training 
data would be more suitable for the classification of female data only, with male sub- 
jects having separate training data. This is also shown by our classification results. 



3 Incorporating Latency Shifting 

In order to deal with the variances in the latencies of the ABRs, the method for ad- 
justing the latency shift between the test data and the training data is considered with 
the aim of improving the classification results. Next two methods to preprocess the 
ABR, based on Woody averaging and latency correlated averaging are introduced. 
These methods have previously been seen to improve the results. Here latency shift- 
ing is combined with wavelet analysis in order to improve classification and reduce 
the number of repetitions needed. 

3.1 Preprocessing the ABR Based on Woody Averaging 

The grand averaged ABR of the training data is again used as the template for Woody 
averaging. The maximum shift of the ABR is restricted to 6, that is, the ABR can shift 
6 data points left and 6 data points right. As the 64 and 128 averaged ABR is shifted 
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by 1 data point, the crosscorrelation between the 64 and 128 averaged ABR and the 
template at each position is calculated. Therefore, 13 crosscorrelations are obtained 
and each of the 64 and 128 averaged ABRs is shifted to the position where the cross- 
correlation is maximum. 

The result of applying Woody averaging to preproess the ABR is shown in Ta- 
ble 5. It is found that the classification results of the 64 averaged ABRs are all impro- 
ved, however the results for the 128 averaged ABRs from patient 1 and patient 2 are 
inferior and only the result of patient 3 is better than the previous result. 

Table 5. The classification results of three sets of test data by applying Woody averaging to 
preprocess the ABR 





The 128 averaged ABRs 


The 64 averaged ABRs 


Patient 1 


68% 


89% 


Patient 2 


61% 


71% 


Patient 3 


86% 


79% 



3.2 Preprocessing the ABR Based on Latency Correlated Averaging 

Since Woody averaging only can deal with shifting the entire signal in latency, la- 
tency correlated averaging is performed to calculate the different shifts for the inde- 
pendent components of the ABR. First three windows corresponding to data points 
[16-46], [64-88] and [88-130] which cover the time periods where wave I, wave III 
and wave V are expected to occur are chosen. Then like Woody averaging, the 64 and 
128 averaged ABR is allowed to shift 6 data points left and 6 data points right, shifted 
by 1 data point at each step. For example, in order to get the latency shift of wave I, 
the crosscorrelation between the windows [16-46], which is the time period for wave 
I, of the 64 and 128 averaged ABR and the template is calculated at each position. 
Then the window [16-46] is shifted according to the latency shift where the crosscor- 
relation is maximum. In this way, the latency shifts of wave III and wave V are calcu- 
lated by changing the windows to [64-88] and [88-130] and these two windows are 
shifted according to the latency shifts. The whole preprocessing is complete after 
shifting the three windows to the optimum positions. 

Table 6. The classification results of three sets of test data by applying latency correlated aver- 
aging to preprocess the ABR 





The 128 averaged ABRs 


The 64 averaged ABRs 


Patient 1 


91% 


77% 


Patient 2 


79% 


75% 


Patient 3 


71% 


75% 



Table 6 is the classification results of the test data after applying latency correlated 
averaging to preprocess the ABR. Comparing Table 6 with Table 4, all the classifica- 
tion results are getting better except for the 128 averaged ABR from patient 3 which 
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is the same. This preprocessing method adjusts the variances in the latencies between 
the test data and the template, which indicates this approach as more suitable for the 
classification of both the female and male data, and especially provides improvement 
of the male data. 



4 Conclusions and Further Work 

The aim of this study is to determine if a response can reliably be detected. If a res- 
ponse is present in the ABR at a certain stimulus intensity, it shows that the subject 
can hear at this level; if not, the subject cannot hear this level of stimulus. In this way, 
the hearing threshold and the hearing loss of the subject can be detected. Because of 
the small amplitudes of the ABR, it is always obscured by the background EEG acti- 
vity. Conventional averaging to pick out the ABR requires up to 2000 repetitions, 
which would be uncomfortable for both the subject and the clinician, so that the re- 
duction in the number of trials required offers a great advantage in the clinical situati- 
on. In this study, 64 and 128 averaged ABRs are used to get promising results. 

Four sets of adult data, two females and two males, are used in this paper. All 
ABRs are processed by a 5-level, DWT, with Bior 5.5 being the mother function. The 
wavelet coefficients are thresholded and matched with the template in order to keep 
those which are caused by the response to the stimulus. After combining the classifi- 
cation rules with these significant wavelet coefficients, the rule score of the ABR is 
obtained and used for the classification. 

This feature extraction method based on wavelet analysis allows investigation and 
classification of the ABR. In order to improve our original classification result, two 
preprocessing procedures, based on Woody averaging and latency correlated avera- 
ging, are performed to solve the problem of the variances in the latencies of the 
ABRs. The results of Woody averaging for the 64 averaged ABRs give improvement, 
but for the 128 averaged ABRs, there is in fact a deterioration in results. This is pos- 
sibly because the latency shift of each peak of the ABR is not linear. Woody avera- 
ging only provides a shift of the complete ABR so that it is not suitable for preproces- 
sing the ABR, which has independent peaks. Comparing with the original results, 
latency correlated averaging provides better results for both the 64 and 128 averaged 
ABRs except for the result of the 128 averaged ABRs from patient 3 which remains 
the same. 

In general, Woody averaging shows improvement for the 64 averaged ABRs, but 
not for the 128 averaged ABRs. Latency correlated averaging shows improvement in 
all cases. These results require further study with a larger number of subjects. 

In addition, this approach allows to test both the female data and the male data by 
adjusting the latency shift based on the preprocessing procedure, although the laten- 
cies of the ABRs are different because of the different genders. By modifying the 
classification rules and the classification thresholds, it could also be applied to ABRs 
of different stimulus intensity and to people with different age ranges. Moreover other 
preprocessing methods can be used to improve the signal to noise ratio of the ABR, 
and these will be the research subject in the future. 

Finally, this approach can also be applied to the other kinds of EP/AEP data, such 
as the visual evoked potentials, the somatosensory evoked potentials and so on. Due 
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to the longer recording times involved, any reduction in the number of repetitions 
would be significant. 
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Abstract. Diabetes is a metabolic disorder which can be greatly affected by 
lifestyle. The disease cannot be cured but can be controlled, which will mini- 
mize the complications such as heart disease, stroke and blindness. Clinicians 
routinely collect large amounts of information on diabetic patients as part of 
their day to day management for control of the disease. We investigate the po- 
tential for data mining in order to spot trends in the data and attempt to predict 
outcome. Feature selection has been used to improve the efficiency of the data 
mining algorithms and identify the contribution of different features to diabetes 
control status prediction. Decision trees can provide classification accuracy 
over 78%. However, while most bad control cases (90%) can be correctly clas- 
sified, at least 50% of good control cases will be misclassified, which means 
that current feature selection and prediction models illustrate some potential but 
need additional refinement. 



1 Introduction 

According to World Health Organisation, on a global scale, there are around 194 
million people with diabetes in the adult population, 50% of which are undiagnosed 
[ 1 ] . Diabetes describes a condition in which the body cannot make proper use of 
carbohydrate in food because the pancreas does not make enough insulin, or the insu- 
lin produced is ineffective, or a combination of both. 

It is estimated that 49,000 people in Northern Ireland (NI) have been diagnosed 
with diabetes and another 25,000 have the condition but don’t know it. According to 
the investigation of National Health Service (NHS) in UK [2]: 

— £1 in every £7 spent by the NHS in NI goes towards the care of diabetes; 

— People with Type 2 diabetes have had the condition for between 9 and 12 year 
before they are diagnosed; 

— At least one third to as many as one half of the people diagnosed with diabetes will 
have diabetes-related complications when initially diagnosed; 

— Approximately three in every 100 people will develop diabetes. 

Insulin is the hormone that helps glucose derived from the digestion of carbohy- 
drate in food, to move into the body’s cells where it is used for energy. When insulin 
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is not present or is ineffective, glucose builds up in the blood. As people with diabe- 
tes have problems with their insulin, it is necessary for them to take steps to either 
create insulin or to help the insulin normally work [3], 

Diabetes can be classified into three main forms: Type 1, Type 2 and gestational 
diabetes. The most common form is Type 2 diabetes, previously called non-insulin- 
dependent diabetes mellitus (NIDDM). The patients may have some endogenous 
pancreatic insulin secretion although this is less than normal. Furthermore their tis- 
sues may be relatively resistant to the effects of insulin [4]. Type 2 diabetes is a 
common disease that is associated with high mortality and morbidity from macro- 
and micro-vascular disease, which greatly affects patient’s quality of life, and makes 
it a major public health problem. It has been shown that better blood glucose control 
will reduce the risk of complications significantly [5-8]. 



Relateive 

Risks 



6 7 8 9 10 11 12 

HbAlc, % 

Fig. 1 . Relative risk for the development of diabetic complications [9] 

Checking the blood glucose at various times of the day can provide a snapshot 
view of an individual’s metabolism. Assuring that the blood glucose is well con- 
trolled is critical in preventing diabetes-related complications. HbAlc is a laboratory 
test, which reveals a patient’s average blood glucose over the previous 12 weeks. 
This test is recommended by the American Diabetes Association [TO] to monitor long 
term glucose control. HbAlc is usually recorded every 3 months, but may be per- 
formed more often, if needed. Specifically, it measures the number of glucose mole- 
cules attached to haemoglobin, a substance in red blood cells. According to the inves- 
tigation of UK Prospective Diabetes Study group (UKPDS), every 1% HbAlc 
reduction means 35% less complication risk, both for the micro- and macro-vascular 
complications. Micro- vascular complications increase dramatically when the HbAlc 
measurement is over 10% [11]. Fig. 1 shows the relationship between the risk of 
diabetes-related complication and laboratory HbAlc values. 
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2 Data Mining in Diabetes 

Data mining is defined as the process of exploration and analysis of large quantities 
of data in order to discover meaningful patterns and rules. It is the extraction of non- 
trivial, implicit, previously unknown and potentially useful information from large 
amounts of data. It comprises a variety of techniques used to identify nuggets of in- 
formation or decision-making knowledge in data, and extracting these in such a way 
that they can be put to use in areas such as decision support, prediction and estima- 
tion. The data is often voluminous but, as it stands, is of low value as no direct use 
can be made of it; it is the hidden information in the data that is useful. 

A typical data mining procedure comprises pre-processing, applying a data-mining 
algorithm, and post-processing the results. Different aspects of information technol- 
ogy are applied to data mining for better performance, including machine learning, 
statistical and visualisation approaches. 

Due to the greatly increased amount of data gathered in medical databases tradi- 
tional manual analysis has become inadequate, and methods for efficient computer- 
based analysis are indispensable. To address this problem, knowledge discovery in 
databases (KDD) methods are being developed to identify patterns within the data 
that can be exploited. Data mining methods have been applied to a variety of medical 
domains in order to improve medical decision making; diagnostic and prognostic 
problems in oncology, liver pathology, neuropsychology, and gynaecology. Improved 
medical diagnosis and prognosis may be achieved through automatic analysis of pa- 
tient data stored in medical records, i.e., by learning from past experiences [12]. 

Data analysis of diabetes has previously been reported [13]. Important goals in the 
management of diabetes include the early detection of people at risk of having diabe- 
tes and the management of secondary illnesses like diabetes-related blindness, stroke 
and heart attack. It has been proved that the information gathered from the medical 
records did help the health agencies and it was also provided to the doctors in the 
form of advice on best practice [14]. 

The study of Duhamel et al. [15] analysed the pre-processing step of data mining 
(mainly focused on data cleaning) and provided tools to handle inconsistent data and 
missing values on a large diabetes database. They applied two methods: imputation 
using decision tree and mode to dealing with incomplete data. The former provided 
better results according to their research. 

Stilou et al. [16] applied the ‘ apriori ' algorithm to a database containing records of 
diabetic patients and attempted to extract association rules from the stored real pa- 
rameters. The results indicate that the methodology presented may be of good value 
to the diagnostic procedure, especially when large data volumes are involved. The 
implemented system offered an efficient and effective tool in the management of 
diabetes. 



3 Data Preparation 

The diabetic patients’ information has been collected by the Ulster Community and 
Hospital Trusts from the year 2000 to the present. All the data was stored in Dia- 
mond, which is a commercial diabetes clinical information system developed by 
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HICOM technology. The database contained 2,017 Type 2 diabetic patients’ clinical 
information; 1,124 males and 893 females. The patients’ age ranged from 20 to 96. 
The patients’ average age is 66, and only 47 patients are younger than 40. Therefore, 
the work concentrates on the older group. Fig. 2 is the distribution of patients’ age in 
the research database. 

The 425 features in the databases include patient characteristics, treatment, com- 
plication care, physical and laboratory findings. 




Age 

Fig. 2. The distribution of patients’ age in the database 

Because of the characteristics of medical datasets (incompleteness, incorrectness 
and inconsistency) [12], pre-processing is a very important stage. The pre-processing 
work in this research consists of data integration and reduction. 

Data integration combines data from multiple sources into a coherent database. 
Because data mining algorithms most often are based on a single table, within which 
there is a record for each individual, and the fields contain variable values specific to 
the individual. In this study, the first step is to integrate metadata from six different 
databases to one target dataset. 

The process of data reduction represents the selection of parameters that may in- 
fluence blood glucose control. Preliminary data set inspection showed that for 37% of 
attributes has more than 50% of missing data. These attributes were considered too 
sparsely collected and were not included in further analysis. Fortunately, the key 
predictors confirmed by the diabetic expert did not include in the 37% attributes. Of 
the remaining parameters, forty-seven were suggested by the background knowledge 
of diabetic experts and international diabetes guideline [4]. 

The resulting data set had an average 10.3% of missing values. Among the initial 
47 features, there are 6 attributes with 30-40% missing data, 3 with 20-30%, 3 with 
10-20% and 11 with 1-10%. There are 30,330 records in the dataset, 34.33% of cases 
are in good diabetes control status and 65.67% of them are in bad status. 



4 Feature Selection 

The performance of most practical classifiers improves when correlated or irrelevant 
features are removed. Therefore, in an attempt to improve the efficiency and accuracy 
of classification algorithms, feature selection techniques are widely used on the dif- 
ferent databases. Feature selection is the process of identifying and removing as much 
of the irrelevant and redundant information as possible. In the database, not all attrib- 
utes available are actually useful. Especially in medical research, hundreds of attrib- 
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utes are routinely collected but typically only a small number are used. Being a real 
world problem, a large number of noisy, irrelevant and redundant features are in the 
data. Clearly irrelevant variables have been removed with the help of the diabetic 
expert. The key factors influencing Type 2 diabetes control are expected to be de- 
tected by the feature selection. In the experiment, ReliefF [17] has been used for 
feature selection to investigate those important factors in the Type 2 diabetes data set. 



4.1 Reduction of Input Parameters 



Features were ranked using ReliefF, which measures the usefulness of a feature by 
observing the relation between its value and the patient’s outcome. Features with 
negative ReliefF estimate may be considered to be irrelevant. Features with the high- 
est score are presumed to be the most sensitive and contributing most to the outcome 
prediction [18]. 

There are many possible measures for evaluating feature selection algorithms and 
classification models [19]. We use the criterion Classification Accuracy to evaluate 
the performance of ReliefF. In general, a classifier is expected to preserve the same 
accuracy with the reduced set of features as with all the available features or even to 
improve it due to the elimination of noisy and irrelevant features that may mislead the 
learning process. 

However, we expect not only the high classification accuracy, but also the ability 
of a classifier to distinguish the positive and negative samples in the population. 
Therefore, for medical applications, two other measures are more frequently used 
than the classification accuracy: sensitivity and specificity. Sensitivity measures the 
fraction of positive cases that are classified as positive. Specificity measures the frac- 
tion of negative cases classified as negative. 



Sencitivity = 
Specificity = 
Accuracy = 



TP 

TP + FN 
TN 

TN + FP 
TP + TN 



TP + FP + TN + FN 



( 1 ) 

( 2 ) 

(3) 



Accuracy = 



Sensitivity * Pos + Specifity * Neg 
Pos + Neg 



(4) 



TP: true positives; TN: true negatives; FP: false positives; FN; false negatives; 

Pos; Positives; Neg: Negatives 

In this study, the bad control cases are regards as positives, and good control cases 
are negatives. It can be proved that Classification Accuracy can be calculated by 
Sensitivity and Specificity (Equation 4). 

Because it is difficult to estimate the correct number of predictors in feature min- 
ing applications [20], different sizes of attribute subsets were selected for each of 
three algorithms {Naive Bayes, IB1 and Decision Tree C4.5 [21]) to find which set 
gave the best performance. 
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5 Prediction Model Construction 

Data classification is the process which finds the common properties among a set of 
objects in as database and classifies them into different classes, according to a classi- 
fication model. Classification in data mining aims to predict categorical class labels. 
It constructs a model based on the training set and the values in a classifying attrib- 
ute [22]. 

Algorithms C4.5, IB1 and Naive Bayes were applied to analyse the data in this 
study. Ten-fold cross validation was used to evaluate the effect of each classifier and 
determine the suitable size of attributes for model construction. It divides the data 
into 10 subsets of (approximately) equal size, and then trains the data 10 times, each 
time leaving out one of the subsets from training, using it as the testing data set [23]. 

In addition, on the basis of Perner' [24] and Dougherty’s [25] research, in some 
case, the results of C4.5 can be significantly improved if features are discretized in 
advance. For the purpose of achieving better performance in this experiment, a simple 
and fast discretization method has been used on the data for the investigation, which 
is an entropy-based strategy combined with minimal description length principle [26]. 
The model constructed by C4.5 after discretization is called discretized C4.5 in the 
next section. 



6 Results and Discussion 

The attributes used for predicting the patient’s diabetes control have been ranked by 
ReliefF. Each classifier presented above was applied to the data set with different 
numbers of variables. The top five predictors identified by ReliefF are Age , Diagnosis 
Duration , Insulin Treatment , Smoking and Family History. These have been addition- 
ally verified and confirmed by the diabetic expert as important predictors of diabetes 
control. It surprised the diabetic expert that BMI (Body Mass Index) did not include 
in the top eight predictors, and Smoking was not expected to be more predictive than 
others such as Complication Type and BMI. The fact indicates that there were some 
novel observations detected by feature mining technology. However, the results are 
required to be further confirmed before they can influence the clinical domain. 

The classification accuracy generated by all the three classifiers is listed in Table 1. 
From the table, we can find that before feature selection, discretized C4.5 had the best 
performance for classification. And after feature selection, C4.5 obtained the best 
results. The results of IB1 and Naive Bayes were lower than C4.5 and discretized 
C4.5 before feature selection. The discretization method did not improve the per- 
formance of decision tree significantly. This fact indicates that discretization does not 
always work well in every database. However, when using the 47 variables for pre- 
diction, discretized C4.5 had the highest classification accuracy among the four clas- 
sifiers. It suggests that discretization technology can adjust the bias of data, making 
the decision tree more robust. But the effectiveness depends on the characteristics of 
the applied data. 

Table 1 also presents the influence of different feature subsets to each classifier. 
On average, when the top 15 variables were selected for classification, the best pre- 
diction result can be achieved. 
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Table 1. Classification accuracy (%) for Different Sizes Feature Subsets 



Attribute 

Number 


Naive Bayes 


IB1 


C4.5 


Discretized 

C4.5 


Average 


5 


69.36 


69.14 


76.36 


75.23 


72.52 


8 


74.60 


70.49 


76.12 


75.76 


74.24 


10 


72.47 


71.54 


77.21 


77.46 


74.67 


15 


72.92 


70.37 


78.73 


78.12 


75.04 


20 


71.48 


69.30 


76.42 


76.73 


73.48 


25 


69.24 


67.88 


77.52 


77.75 


73.10 


30 


70.53 


67.78 


77.43 


77.52 


73.32 


47 


62.35 


63.44 


75.38 


76.37 


69.39 


Average 


70.37 


68.74 


76.90 


76.87 





Compared with IB1 and Naive Bayes, the study indicates that feature mining did 
not affect the results of the decision tree C4.5 significantly before or after discretiza- 
tion. The main reason is that decision tree schemes have inherent feature selection 
mechanism. During the procedure to construct a tree, they select the key predictor 
gradually. This is consistent with the research of Perner [20], which concluded that 
feature subset selection should improve the accuracy of the decision tree approach 
(but not significantly). IB1 and Naive Bayes did benefit from the reduction of the 
input parameters. Because classifiers IB1 and Naive Bayes cannot filter out the irrele- 
vant or correlated information in the database, the representation and quality of data 
will affect their performance. 

It is interesting that when all the available attributes are involved in the analysis, 
worse results are generated with the exception of discretized C4.5. This experiment 
shows that not all attributes are actually relevant in the database for diabetic control 
prediction. The performance of most practical classifiers ( C4.5 , IB1 and Naive Bayes 
in this study) improves when correlated and irrelevant features are removed. Accord- 
ing to the decision tree constructed by C4.5, the variable “ Insulin Treatment'’ was the 
best predictor for classifying patients’ disease control. In the patients, “Age” was the 
second best predictor for classification. The attributes “Family History” and “Diagno- 
sis Duration ” are also key features for distinguishing the bad blood glucose control 
patients. The setting of the minimum instance number determines the size of the final 
tree. The key variables gained by the decision trees C4.5 tallied with the parameters 
selected by the feature selection approach ReliefF. 

For many applications it is important to accurately distinguish false negative re- 
sults from false positives. This is particularly important for medical diagnosis where 
the correct balance between sensitivity and specificity plays an important role in 
evaluating the performance of a classifier [27]. Intuitively, in this study, it is more 
important to detect bad control individual from the population than observe good 
control cases. 

Table 2 shows the sensitivity and specificity generated by each classifier. Gener- 
ally, it can be known that discretized C4.5 has the best performance to distinguish bad 
blood glucose control patients from the population, and Naive Bayes can detect the 
good blood glucose control patients best. Naive Bayes has the least difference be- 
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tween sensitivity and specificity. It indicates that Naive Bayes can classify both pa- 
tient groups under a reasonable accuracy. And obviously, IB1, C4.5 and discretized 
C4.5 tend to have better performance when just checking the bad control in the popu- 
lation. 

Table 2. Sensitivity and Specificity for Different Sizes Feature Subsets (Sensitivity/Specificity) 



Attribute 

Number 


Naive Bayes 


IB1 


C4.5 


Discretized 

C4.5 


5 


0.912/0.276 


0.892/0.306 


0.947/0.413 


0.938/0.397 


8 


0.921/0.411 


0.883/0.365 


0.951/0.398 


0.942/0.405 


10 


0.782/0.615 


0.907/0.349 


0.962/0.409 


0.957/0.426 


15 


0.631/0.781 


0.912/0.306 


0.973/0.432 


0.987/0.387 


20 


0.685/0.772 


0.838/0.416 


0.940/0.428 


0.963/0.393 


25 


0.656/0.762 


0.821/0.407 


0.932/0.475 


0.972/0.405 


30 


0.708/0.700 


0.835/0.377 


0.935/0.467 


0.955/0.431 


47 


0.587/0.693 


0.810/0.298 


0.928/0.421 


0.964/0.381 


Average 


0.735/0.625 


0.862/0.353 


0.946/0.430 


0.960/0.403 



It is well known from clinical studies that Type 2 diabetes is a progressive condi- 
tion with overall blood sugar control deteriorating with time (UKPDS). Older patients 
and those who have been diagnosed with Type 2 diabetes for longer generally have 
the worst overall blood sugar control. It is therefore reassuring from the clinical 
standpoint (and affirms the validity of the data mining techniques used) that “Age” 
and “ Diagnosis Duration” were the features selected as the principle factors in de- 
termining whether overall blood sugar control was good or bad. 

Improving blood glucose control is a key aim in treating individuals with diabetes. 
Sustained good blood sugar control reduces the risk of long-term diabetes complica- 
tions. Improved blood sugar control can be achieved with diet and regular physical 
activity, oral medications, insulin injections or by a combination of these approaches. 
In Type 2 diabetes, as time proceeds, patients generally move in a stepwise fashion 
through dietary treatment, then oral therapies and eventually end up needing insulin 
therapy to control their blood sugars. Despite all of these treatments blood sugar con- 
trol continues to deteriorate with time - so it is likely that those on “Insulin Treat- 
ment” would have the worst overall blood sugar control. “ Insulin Treatment ” was 
selected as the best predictor for classifying blood sugar control. This again makes 
clinical sense. 

Overall there was high concordance between the features selected using data min- 
ing techniques and the factors anticipated as being important by the diabetes expert. 
The models’ high best predictive performance and the clinical relevance of the fea- 
tures selected suggest that decision support and prediction will be achievable with 
further refinements. 

7 Conclusion and Future Work 

Preliminary study results on this research has been previously presented by the au- 
thors [13]. Technically, feature selection can improve the performance of classifiers 
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including accuracy and efficiency; clinically, the key predictors of diabetes control 
has been confirmed by the diabetic expert. 

The performance of a binary classifier is usually quantified by its accuracy during 
the test phase, i.e. the fraction of misclassified cases on the test set [27]. In the 
experiment, we evaluted the performance of classifiers using not only accuracy but 
also sensitivity and specificity to qualify their ability for false positive and false 
negatives. It can be seen from the results that Naive Bayes has the best capacity to 
keep the balance between sensitivity and specificity, but with lower accurracy. On the 
contrary, the decision trees obtain the highest accurray and can distinguish bad 
control cases from the population best, but misclassify more than half good control 
cases. This may result from various reasons such as the influence of missing values, 
the characteristics of selected predictors and the imbalance of class label. It will be 
worth investigating how to achieve the best balance among these three measurements 
for a practical classifier. 
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Abstract. Cytochrome P450s are an important class of drug metaboliz- 
ing enzymes which play a significant role in drug metabolism, and thus in 
the drug discovery process. With a data set that was compiled from pub- 
lic available data on cytochrome P450 drug interaction data, and derived 
calculated chemoinformatics data, we have built binary classifiers based 
on kernel methods, in particular support vector machines implementing 
the nearest point algorithm. Feature selection is used as a preliminary 
stage of supervised learning. We work on supervised as well as on un- 
supervised selection methods. The classification results from a selected 
subset of the test set are compared structurally with compounds from 
the training set. 



1 Introduction 

A chemical compound cannot become a valuable therapeutic drug simply due 
to its activity. In addition, it should be well absorbed, and distribute to target 
organs to an adequate extent. Further it has to be metabolized in such a way 
that the pharmacologic activity is retained, no toxic metabolites are formed, and 
the drug is not eliminated too fast. In this paper we focus on a small part of 
these potential problems addressing the AD MET properties of a drug 1 which are 
related to a class of enzymes called the cytochrome P450s [1]. Cytochrome P450s 
are drug metabolizing enzymes, which, together with others like monoamine 
oxidase (MAO) or UDP-glucuronosyl transferase [2], are principally found in 
the liver. Knowledge, or better, reliable prediction of interactions of drugs with 
these enzymes are important to the pharmaceutical industry since they are the 
source for the formation of toxic metabolites or lead to high and fast degradation 
within the human body resulting in an overall low efficacy of a drug. 

In the drug development pipeline [3], assessment of high quality experimental 
data under in vivo conditions is time consuming, has low throughput, is resource 

1 Acronym for critical early development parameters of drugs: Absorption, Distribu- 
tion, Metabolism, Excretion and Toxicology. 



J.A. Lopez et al. (Eds.): KELSI 2004, LNAI 3303, pp. 191-205, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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demanding, and therefore only available at a late stage of the process. Because 
of these reasons one would appreciate the ability to predict metabolic properties 
of new chemical entities (NCEs) with computational methods to become able to 
virtually screen large compound libraries, or to synthesize focused compounds 
with desirable metabolic behaviors. 

We decided to work not only with one special machine learning approach, 
but rather to make use of different techniques - classical statistical methods, as 
well as recently proposed algorithms. We will introduce some of these methods 
in this paper. The paper is organized as follows: in Sect. 2 we describe the data 
sets which are the basis of our tests, and introduce our notation; Sections 3 and 4 
describe the feature selection and supervised learning methods we implemented 
for our application area; Section 5 presents some of the results we achieved, 
shows general trends we detected, and gives conclusions of our work. 

2 Description of the Data Sets 

It has been reported that 5 isoforms 1A2, 2C19, 2C9, 2D6, 3A4 [4] of the P450 
superfamily are involved in the metabolism of 90% of all drugs. The isoform clas- 
sification is based on the protein sequence with increasing homology in a family 
(e.g. Cypl) and subfamily (e.g. CyplA). The cytochrome P450 isoforms show 
varying distributions in individual species. This different distribution in species 
is important since preclinical results from rats and mice are used to extrapolate 
to effects in the human body. The reaction catalyzed by these enzymes is the 
transfer of one oxygen atom from molecular oxygen to the substrate. 

To improve the overall classification result we therefore calculated additional 
properties (features) of the atoms and bonds in the drug (ensemble) because 
e.g. the extent of bond dissociation (C-H activation) measures directly how easy 
the insertion of oxygen occurs. Drug interactions were further divided into sub- 
strates, inhibitors and inducers according to Flocklrart [5]. Substrates are com- 
pounds oxidized by the enzyme; inhibitors block the function of the enzyme so 
that its overall activity is reduced; inducers enhance the activity of the enzyme 
over time. 

Overall, the data set contains 296 diverse compounds, predominantly drugs 
or drug-like molecules, including their activity profiles. From the correspond- 
ing compounds we have calculated physicochemical properties. These molecular 
descriptors are mathematical representations of the chemical structures. With 
the MOE program [6], we calculated overall 306 ensemble descriptors includ- 
ing easily interpretable descriptors like atom counts, others which describe the 
connectivity of the molecular graph like the topological descriptors of Randic 
[7] or Kier and Hall [8], as well as 3D-descriptors that encode the 3D (Van der 
Waals) surface of the molecule annotated with physical properties like charge, or 
hydrogen bond acceptor potential. The 3D-based features were calculated from 
the 3D coordinates of the compounds generated by Corina [9] and the corre- 
sponding atom and bond properties by Petra [10]. The depiction of chemical 
structures was done with the CACTVS program [11]. 
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To estimate the impact and contribution of ensemble, bond and atom features 
on classification results, we generated three data sets (DS1, DS2, DS3). DS1 
contains only the ensemble features, and leads to one feature vector per drug. 
DS2 contains the ensemble features and the features for every atom in the drug; 
consequently number-of-atoms count feature vectors are generated for each drug. 
DS3 was assembled with the calculated bond features. In addition, we used every 
single data set to define another three sets DSlb, DS2b and DS3b. Altogether 
we had 49 Cyp splice variants although 43, like 2C9 or 11B1, were not chosen to 
define a classification task. We extracted their information to find out if there 
are additional informations in activity profiles of drugs. This leads to 43 extra 
features for each data set. Table 1 gives a short description of our six data sets. 
Later we will focus on the problem of imbalanced data sets 2 and the way we 
tackled it. 



Table 1 . Characteristics of data sets 





-#■ classes 


# observations 


# in class 1 


% in class 1 


# features 


DSl(b) 


2 


296 


40 


13,5% 


306 (349) 


DS2(b) 


2 


13206 


1407 


10,7% 


352 (395) 


DS3(b) 


2 


27478 


1769 


6,4% 


346 (389) 



Notations which will be used in this paper: m € IN is the number of refer- 
ence observations and n € IN the number of properties (features). The vector of 
a single observation i £ {1, ..., rn] is called T l and is the i- th row of the m x n 
training matrix (reference data). Its Euclidean norm is denoted by ||T*||. Addi- 
tionally, we introduce a binary class vector y £ {0, l} m , and two sets Co and C\ 
where the i-th entry of y signifies that the i-tli observation belongs to class 0 
(■ yi = 0, i £ C 0 ) or 1 {yi = 1, i £ Ci). 

3 Feature Selection 

Feature selection refers to the task of dimensionality reduction and capacity 
control [12]. The problem of selecting properties which are responsible for given 
outputs occurs in various machine learning applications [13, 14]. We used feature 
selection methods with the objective of 

1. improving classification performance, 

2. detecting features that are responsible for the underlying class structure, 

3. reducing the training time of our classification algorithms. 

One can distinguish between supervised [14] and unsupervised [15, 12] feature 
selection. The main idea of supervised methods is to utilize the given output 
vector, whereas unsupervised feature selection works without knowledge of class 
affiliations. We used both methods in view of a comparison of classification 
results. 

One of the classes contains few examples to learn. 



2 
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3.1 Supervised Feature Selection 

Our supervised feature selection algorithm ranks all properties, whereby high 
ranking is assigned to features with the ability to specify the underlying class 
structure. We achieve this by testing the significance of each dimension k, k = 
1, ..., n with regard to the given output vector. 

For a single property X k we can always define two samples. The first sample 
consists of observations which belong to class 0; observations from class 1 are 
considered to belong to the second sample. We analyze if the one-dimensional 
distributions differ greatly, since we believe that this indicates a correlation be- 
tween feature and class mapping. For this purpose, we use the X 2 test (chi 
squared test) for homogeneity [16]. The X 2 test is the most popular discrete 
data hypothesis testing method. It requires a categorical characteristic X with 
c categories and g groups (c, g £ IN) and examines whether the distribution 
of X is similar in all groups. Note that g = 2, as we consider binary problems. 
Since not all space dimensions are discrete, it is important to assign all values 
to a category. First we scale the data to variance 1, and start with predefined 
intervals; then we examine the number of representatives in each category, and 
merge neighboring intervals if less than 5 representatives occur 3 . This ensures 
that each category contains an adequate number of members. The final test re- 
sult shows whether the current feature is suitable to contribute to the underlying 
classification, or wether we can abandon its use as a descriptor of the system. In 
the latter case, we delete the fc-th column of the reference matrix as is done for 
a similar approach with Kolmogorov-Smirnov tests [13]. 

Investigation of single properties can lead to strong correlation of features 
with high ranking, e.g. if two columns of the data matrix are equal. For this 
reason, we propose hierarchical clustering of properties [17] . One popular method 
is the complete-link algorithm, where the distance between clusters Co and C\ is 
defined as the maximum of all pairwise distances between members of Cq and C\ 
[17]. Our algorithm uses rank correlation between pairs of features to define a 
cluster tree, and replaces highly correlated subtrees by single features with high 
rankings. 



3.2 Unsupervised Feature Selection 

In addition to our supervised method, we also developed unsupervised feature 
selection algorithms to reduce the dimensionality of our data sets. The well- 
known principal component analysis (PCA) [18] is not suited for our purpose 
since principal components are linear combinations of all features, and do not 
lead to the selection and reduction of input data. However, we implemented 
methods based on PCA results [12], that offer the selection of p features or the 
elimination of n — p features. 

[15] proposes another approach. The method of principal variables does not 
pass any PCA, but undertakes the attempt to assign the optimality property of 

3 This value is generally known as the minimal demand. 
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principal components to single features. To be more precise, the first principal 
component always shows high variance and minimizes the sum of the remaining 
variances in the data set. Thus the first selected feature is X Jl where 



ji = argmax ( Var(V fc ) p 2 (X k ,X J ) 



^fc=i 



p(X k ,Xi) denotes the correlation between X k and Xh Substituting 

Cov(V fe , X J ) 



p{X k ,X°) = 



leads to 



ji = argmax 

l=j=n 



v / Var(X fc ) x /Var(X') 

n > 

^Cov 2 (X k ,X j ) 



1 



Var(Xi) 



fc=l 



(1) 



(2) 



(3) 



We chose this method for our tests due to its low demand for resources. 



4 Classification 

Classification problems occur when given data points ought to be assigned to one 
of two or more classes [17, 19]. Data sets tend to increase dramatically, and one 
has to develop fast and reliable classifiers. We decided to work on different kernel 
methods [20]; currently we are examining support vector machines (SVMs) [22] 
and kernel density classification [23] both suitable for supervised learning. 
Kernels can be interpreted as similarity measures in the space of input data. 

Due to restrictions on the size of this paper, we omit a detailed description 
of kernel density classification, a fast and reliable method for classification of 
large data sets, particularly because Sect. 5 will discuss SVM results exclusively. 
Instead we will address some of the details of our SVM implementation. 

4.1 Quality 

To be able to compare classification results, one can use different characteris- 
tics. We give a short introduction to some well-known metrics, as well as the 
enrichment factor, and receiver operating characteristic graphs. 

It is very important to count mis-classified points. For a test set, we de- 
note the number of true positive (true negative, false positive, false negative) 
predictions with TP (TN, FP, FN). The respective rates are 

tp := tp+fn t rue positive rate fp := T ^ FP false positive rate 

t n := tWTfp t rue negative rate fn := T p pF N false negative rate 
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Note that tn + fp = 1 and tp+ fn= 1. Often used efficiency measures are 
tp+fn sensitivity (recall) tp+fp precision 

f p+tn specificity tp+fp+tn+fn accuracy 



Now we are able to introduce the enrichment factor. This measure is very 
useful for imbalanced data sets, where precision and accuracy lead to inexact 
views of classification results. For example, for a test set with 1% positive ob- 
servations, we can achieve 99% accuracy without any learning effect. For class 
1, the enrichment factor ef is defined as the percentage of correctly predicted 
observations divided by the original percentage of observations in this class: 



ef := 



fraction of positives after classification 
fraction of positives before classification 

TP 

TP+FP 

p 

P+N 



( 4 ) 

( 5 ) 



Thus the enrichment factor is the ratio of precision and the fraction of positive 
points. Depending on its value, ef can be interpreted as follows: 

ef > 1 improvement (enrichment), 

ef = 1 no learning effect, 

ef < 1 impairment. 

Furthermore, the receiver operating characteristic (ROC), originally used in 
signal detection theory, is also helpful for a presentation of classification results 
on imbalanced data sets [21]. ROC is a scatterplot that shows the relationship 
between recall (y-axis) and false positive rate (x-axis). We used our classifiers 
in a way that outputs only class labels; each classifier produces one single point 
in the ROC space, and we do not obtain ROC graphs. Instead, one can use the 
ROC space to plot test results on different data sets, e.g. to compare classification 
results after the application of different feature selection algorithms. One point 
in the ROC space is superior to another if the recall is higher, or the false positive 
rate is lower, or both [21]. 

This short introduction explains the correlation between the presentation of 
classification results, and the applied measure of quality. 



4.2 Support Vector Machines 

Support vector machines were developed by Vladimir Vapnik [24]. Their linear 
learning approach [25] considers classifiers of the form 

n 

/lin(S) = 5] w kSk + b (we JR n , b G IR) 
k=l 



( 6 ) 
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with yet unknown w and b. We assign point S to class 1 if fu n (S) ^ 0, otherwise 
to class 0. To create a non-linear classification function, we look for a transfor- 
mation that maps the input data to a space IF with higher dimension d €. IN. 
This leads to 



d 

/nonlin(S) = ^ V>k<t>k{S) +b (w € F , b G ]R) (7) 

fc= 1 

where 4>(S) denotes the d-dimensional vector resulting from non-linear transfor- 
mation of S. By application of statistical learning theory [24] and duality theory 
[20], the resulting dual classification function [26] 

m 

/dual (^) = ^ ViOiMS),^)) + b (8) 

i = 1 



depends only on dot products in the higlr-dimensional space IF. The vector of 
Lagrange multipliers a is the global solution of a dual optimality problem. Op- 
timality (Karush-Kulm- Tucker) conditions result in most of the Lagrange mul- 
tipliers becoming zero [20]. Training points with positive a.i values are called 
support vectors. Only a subset of reference points contributes to the classifi- 
cation of unknown points. This is in fundamental contrast to kernel density 
classification, where every single reference data point contributes to the decision 
about the class to which an arbitrary test point belongs. Support vector ma- 
chines avoid the construction of a reasonable non-linear mapping for the input 
data; the dot products in (8) are replaced by function values K(S,T l ) where 
K is a kernel. Although we implemented different kernels, the results presented 
here were achieved with the Gaussian kernel [20] 



K(S,T ) 



-IIS-TH 2 

= e ^ 



(9) 



The kernel width a > 0 has to be chosen by the user. 



4.3 Implementation Details 

Our Fortran90 SVM software implements not only the well-known SMO (sequen- 
tial minimal optimization) algorithm [27] and its improvements [28], but also the 
so called NP (nearest point) algorithm [29] . The nearest point algorithm is a fast 
iterative method which solves the problem arising in training SVMs by comput- 
ing the nearest point between two convex polytopes. It combines and modifies 
ideas of different classical nearest point algorithms [30] . The possibility of using a 
nearest point algorithm to solve SVM problems is made possible by the fact that 
the formulation as a nearest point problem also leads to computation of inner 
products in IF. These can be replaced with kernel evaluations, ensuring usage of 
the SVM kernel trick [20]. As well as SMO, the NP algorithm allows classifica- 
tion violations during the training cycle. The NP algorithm is not as simple as 
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SMO and its implementation is somewhat demanding, but the pseudocode given 
in [29] can help. Our NP algorithm is called SVM-NPA. 

We observed intense usage of support vector machine algorithms implement- 
ing decomposition methods [31] like the SVM ( * 9,lt package [32], see e.g. [13,26]; 
or the SMO algorithm [27] which defines minimal working sets of two points, 
see [33]. Since we believe that it is also important to study relatively unknown 
techniques, we present results achieved by SVM-NPA. Although SMO is a fast 
method, it was outperformed by SVM-NPA. For a data set with 50 features, 
236 training and 50 test points, NPA required 4 seconds, whereas SMO needed 
15 seconds (1.7 GHz, 1 GB RAM, Linux Kernel 2.4.21). Both performed data 
normalization and tenfold cross-validation [14] . Additionally, there is a third mo- 
tivation to use NPA. We deal with imbalanced data sets, and therefore decided 
to use the idea of different error weights C + and C ~ , see [34] for details. The 
SVM parameter C (see e.g. [20]) is used directly in different SMO subroutines, 
but the NPA implementation only uses it in the kernel function 4 , so the real- 
ization of different weights was very easy. The final kernel K* for the SVM-NP 
algorithm is of the form 



K*{T‘,T J ) 




(10) 



where 1 ^ i, j ^ m 5 . One of the challenges when using SVMs for real data sets 
is kernel selection and parameter tuning. We refer to [35] for further information. 
Our parameter selection is based on leave-one-out cross-validation for small data 
sets, tenfold cross-validation for larger ones [14,26], and an additional option 
for recall to have an independent quality check. We implemented a simple grid 
search, and use different optimality criteria for parameter tuning, e.g. the F- 
measure [36] for imbalanced data sets. 



5 Results and Conclusions 

Throughout the classifications, a vast output was generated which had to be 
analyzed by careful perusal. We can therefore only summarize the trends that 
we have found. In addition, we focus here on CyplA2 inhibitors when comparing 
the classification on a molecular basis. For all tests, we used 80% of all points 
for training, and the other 20% for classification (test). We were using test sets 
that had never been seen by the algorithm before, and which were run only once 
for the assessment of generalization performance. 

Support vector machine classification results including enrichment factors 
are summarized in Table 2. Enrichment factors range between 4 and 8, overall 
performance between approximately 88 and 96 percent. 

4 This is a consequence of the L2 soft-margin formulation; for details, see the reference. 

5 Note that C + and C~ have no influence during the classification stage; we only make 
use of them during the training cycle to handle mis-classified reference points. 
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Table 2. Performance of SVM-NPA for the 1A2 system 





supervised feature selection 

10 features 50 features 


unsupervised feature selection 

10 features 50 features 




ef 


accuracy 


ef 


accuracy 


ef 


accuracy 


ef 


accuracy 


DS1 


8 


88,3% 


5 


91,7% 


6 


93,3% 


4 


88,3% 


DSlb 


8 


88,3% 


5 


91,7% 


6 


93,3% 


6 


93,3% 


DS2 


5 


91,3% 


7 


95,1% 


5 


90,4% 


7 


93,5% 


DS2b 


3 


84,4% 


7 


92,1% 


5 


90,4% 


5 


89,1% 


DS3 


6 


91,1% 


6 


93,5% 


6 


92,3% 


8 


95,3% 


DS3b 


5 


91,1% 


6 


92,0% 


5 


91,7% 


7 


93,2% 



The general trends for data sets DSlb, DS2b and DS3b are higher false pos- 
itive rates, and in none of the classifications (except for DSlb with 3 extra true 
negative points) could we increase the overall predictivity of the classifier by 
inclusion of extensive Cyp profiles of the 43 other variants. The feature selection 
algorithms did not recognize this additional information as relevant in distin- 
guishing between the different molecular moieties. This can be explained by the 
low sequence homology shared between the different isoforms, so that the active 
sites of the enzymes recognize different parts of the molecule (substrate recogni- 
tion). For these reasons, we discuss the classification results without additional 
information from the enzyme profiles except physicochemical properties. 

For the 1A2 system, best results of less than 5% overall errors in the test 
set were obtained for DS2 and DS3 using SVM-NPA with 50 features. Moderate 
ranking in predictivity was followed by the 3A4 inducers and 2D6 substrates. 
The most hard to predict were the 3A4 inhibitors and substrates, although error 
rates were still in acceptable ranges. 

In the DS1 series, the unsupervised selection of 10 features lead to the best 
classification results - 5 of 8 positive, and 51 of 52 negative, examples were 
predicted correctly. The SVM-NP algorithm gave overall error rates between 
6, 7% and 11, 7% on the test set. Due to its size (60 points), the differences are 
not significant, as every mis-classified point makes up nearly 2% additional error 
rate. 

In the DS2 series, we used 2700 test points; 288 of them belonged to class 
1. By selecting 50 features with our supervised method, we achieved correct 
classification of 79% in class 1 and 97% in class 0. This is an overall accuracy 
of more than 95%; even our worst error rate of 9, 6% is an acceptable result. 
The inclusion of the atom features mostly improved the classification quality in 
comparison to using only pure ensemble features. This can be explained by the 
additional information that was included with the selected atom features. These 
describe some parts of the substructure better than features that are derived for 
the whole molecule. For instance, differences in C-H labilities lead to different 
oxidation patterns, and distinguish between similar structures. We found out 
that 50 features lead to a better overall performance and to a better prediction 
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of negative examples than 10 features, which indicated less false negatives. This 
trend was true not only for the 1A2 system, but also for other systems we 
analyzed. We conclude that the use of different classifiers and feature selection 
methods gives us a possibility to tune the favored classification output into the 
direction of less false positive, or less false negative, results. This trend can be 
strengthened by varying model parameters, e.g. changing the value of a in the 
Gaussian SVM kernel, or the level of acceptance in kernel density classification. 
Usually, a low false negative rate is desired for the screening of large compound 
collections; the false positives will be detected by experimental verification. 

The DS3 series includes 5500 examples for classification, where 590 are posi- 
tive. The overall error rates range between 4, 7% and 8, 9%. DS3 compared with 
DS2 gave better overall results, and reduced false positive predictions; but at 
the same time, it increased the number of false negatives 6 . This enables us to 
construct sensible classifiers. 

We can visualize the error rates as points in the ROC space, see Fig. 1. 
With our 6 data sets (2x ensemble, 2x atoms, 2x bonds), each classified with 
4 different feature sets, we have 24 points. One can conclude that our classifiers 
are conservative, since they have few false positive errors and low true positive 
rates. This result seems to be convenient, because our data sets are imbalanced 
and thus make the left-hand side of ROC space very interesting [21]. 




FALSE 

RATE 



POSITIVE 



Fig. 1 . Points in the ROC space produced by SVM-NPA for the 1A2 system 



The underlying features are computed properties of molecules. This gives 
us the ability to evaluate how a specified classifier can help. Figure 2 shows 
some examples of the compounds which were used in the training set. Some of 
the structural moieties are later recognized in the classified compounds. We se- 
lected a representative subset of compounds from the test data that allowed us 

6 The best overall rate of 4,7% is divided into fp = 1,4% and fn = 32,2%. On 
comparison with fp = 8, 3% and fn = 20, 8% for the best result in DS2, the situation 
is evident. 
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Fig. 2. Examples of CyplA2 inhibitors in the training set 




Fig. 3. Selected compounds and their classification results 



some interpretation on the molecular level, see Fig. 3 (classification results are 
shown for 50 features and supervised selection) . In comparison with the training 
set (Fig. 2), we found some drugs that share a similar shape like ciprofloxacin 
and tangeretin or kaempferol. Interestingly, ciprofloxacin was always classified 
correctly independent of the data set or algorithms used. The same is true for 
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Table 3. Molecular descriptors chosen by feature selection algorithms 



Atom counts 


M_COUNT_AROMAT Number of aromatic atoms 


M_COUNT_NH 


Number of NH groups 


M.COUNT.OH 


Number of OH groups 


a_acc 


Number of hydrogen bond acceptor atoms (not counting 
acidic atoms but counting atoms that are both hydrogen 
bond donors and acceptors such as -OH) 


a_hyd 


Number of hydrophobic atoms 


a_nC 


Number of carbon atoms 


a_nF, a_nCl, a_nBr 


Number of fluorine, chlorine, bromine atoms 


Bond terms 


B_BDE 


Bond dissociation energy 


BJ3ENSIG 


Difference in sigma-electronegativity 


B_PDELOC 


Delocalization stabilization of positive charge 


B_POLARIZABILITY Mean of effective atom polarizabilities 


B.SQIT 


Sum of sigma-charges shifted over all iterations 


Topological / Connectivity / Shape terms 


chiO 


Atomic connectivity index (order 0) from [8] 

calculated as the sum of —h — over all heavy atoms i with di > 0 
v 

Kier Atom Type Count (cQ3X3) {17} 


kC_aasC 


kC-dsCH 


Kier Atom Type Count (CHX3){11} 


kS-aasC 


Kier Atom Type E-state Sum (cQ3X3) {17} 


kS_dsCH 


Kier Atom Type E-state Sum (CHX3) {11} 


2D/3D terms 


CASA+ 


Positive charge weighted surface area 


PEOE_PC+ [37] 


Total positive partial charge: the sum of the positive qi 


PEOE_PC- 


Total negative partial charge: the sum of the negative q% 


PEOEJtPC- 


Relative negative partial charge: the smallest negative qi 
divided by the sum of the negative qi 


PEOE.VSA+4 


Sum of Vi where qi is in the range [0.20, 0.25] 


PEOE_VSA_FHYD 


Fractional hydrophobic Van der Waals surface area 


PEOE_VSA_FNEG 


Fractional negative Van der Waals surface area 


PEOE_VSA_FPOL 


Fractional polar Van der Waals surface area 


vsa_base 


Approximation to the sum of VDW surface areas of basic atoms 


vsa_pol 


Approximation to the sum of VDW surface areas of polar atoms 
(atoms that are both hydrogen bond donors and acceptors), 
such as -OH 


ASAT 


Water accessible surface area of all polar (\qi\ >= 0.2) atoms 


Energy terms 


E_ele 


Electrostatic component of the potential energy 


E_oop 


Out-of-plane potential energy 


Ejsol 


Solvation energy 


Ejstb 


Bond stretch-bend cross-term potential energy 


E.tor 


Torsion potential energy 
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apigenin and bergapten whose structures are very close to kaempferol. A num- 
ber of false negatives were observed with caffeine, cimetidine and amiodarone. 
The classification could be improved within DS2 and DS3, whereas a correct 
prediction for amiodarone always failed. The alignment of nitrogen atoms in 
cimetidine can be found in pipemidate, but only if atom or bond features are 
included. Buspirone and alprenolol were correctly recognized to be non-active, 
although similar fragments are in mexiletine or pipemidate. The only false pos- 
itive in this series was coumarin in case of DS1. This is a blueprint for the 
expectation that only the amount of information contained in the training set 
will be reflected in the test set. 

The molecular descriptors in Table 3 summarize the features which were most 
often selected by our supervised and unsupervised feature selection algorithms. 
In addition to frequency as a selection criterion, other features like functional 
groups, the Van der Waals surface, or features that can be understood on the 
molecular level (e.g. energy, bond stretching terms) are listed. From every molec- 
ular encoded feature cluster (surface properties, atom distances, delocalization 
energy) a representative subset was selected. 

In summary, we showed that the implementation of the SVM-NP algorithm, 
together with feature selection strategies, can be used successfully to build clas- 
sifiers for real data sets. Only a small insight into our investigations could be 
presented in such a short publication. We intend to improve our support vector 
machine algorithms, especially in terms of training time and data cleaning. 
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Abstract. Predictive toxicology is the task of building models capable 
of determining, with a certain degree of accuracy, the toxicity of chem- 
ical compounds. Machine Learning (ML) in general, and lazy learning 
techniques in particular, have been applied to the task of predictive tox- 
icology. ML approaches differ in which kind of chemistry knowledge they 
use but all rely on some specific representation of chemical compounds. 
In this paper we deal with one specific issue of molecule representation, 
the multiplicity of descriptions that can be ascribed to a particular com- 
pound. We present a new approach to lazy learning, based on the notion 
of multiple-instance, which is capable of seamlessly working with mul- 
tiple descriptions. Experimental analysis of this approach is presented 
using the Predictive Toxicology Challenge data set. 



1 Introduction 

There are thousands of new chemicals registered every year around the world. 
Although these new chemicals are widely analyzed before their commercializa- 
tion, the long-term effects of many of them on the human health are unknown. 
The National Toxicology Program (NTP) started with the goal of establish stan- 
dardized bioassays for identifying carcinogenic substances (see more information 
at http://ntp-server.niehs.nih.gov). These bioassays are highly expensive in time 
and money since they take several years and sometimes their results are not con- 
clusive. The use of automatic tools could support the reduction of these costs. 
In particular, artificial intelligence techniques such as knowledge discovery and 
machine learning seem to be specially useful. 

The goal of Predictive Toxicology is to build models that can be used to 
determine the toxicity of chemical compounds. These models have to contain 
rules able to predict the toxicity of a compound according to both the structure 
and the physical-chemical properties. A Predictive Toxicology Challenge (PTC) 
[15] was held in 2001 focusing on machine learning techniques for predicting the 
toxicity of compounds. The toxicology data set provided by the NTP contains 
descriptions of the bioassays done on around 500 chemical compounds and their 
results on rodents (rats and mice) of both sexes. 



J.A. Lopez et al. (Eds.): KELSI 2004, LNAI 3303, pp. 206-220, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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There are two open problems in predictive toxicology: 1) representing the 
chemical compounds, and 2) determining which characteristics of chemical com- 
pounds could be useful for classifying them as toxic or not toxic (i.e. the toxicity 
model) . A summary of both the different representations and the methods used 
to build the toxicity model proposed in the PTC can be found in [4]. Basi- 
cally, there are two families of representations: those based on structure- activity 
relationship (SAR) and those based on the compound substructures. SAR are 
equation sets that relate molecular features and that allow the prediction of 
some molecular properties before the experimentation in the laboratory. Ap- 
proaches based on compound substructures ( relational representation) represent 
a chemical compound as a set of predicates relating the atoms composing the 
molecule. Most authors, independently of the kind of compound representation, 
use inductive learning methods to build a toxicity model. 

In [3] we introduced a new relational representation based on the chemical 
nomenclature and also a lazy learning technique to assess the toxicity of com- 
pounds. The main difference between our approach and those of the PTC is 
that we do not try to build a toxicity model, but we assess specifically the tox- 
icity of each new chemical compound. This is because lazy learning techniques 
are problem-centered, i.e. they solve a new problem based on its similarity to 
other problems previously solved. In the toxicology domain, lazy learning tech- 
niques assess the toxicity of a chemical compound based on its similarity to other 
chemical compounds with known toxicity. 

In particular, in [3] we proposed to use the k-NN algorithm [10] for assessing 
the toxicity of a chemical compound. Because chemical compounds are repre- 
sented using feature terms [2] (i.e. they are structured objects) we defined a 
new similarity measure called Shaud to be used in the k-NN algorithm. Results 
obtained with the lazy learning approach using the feature terms representation 
of the compounds are comparable to the results obtained using inductive ap- 
proaches. Moreover, in our representation only the molecular structure is taken 
into account whereas SAR approaches use a lot of information related with prop- 
erties of the molecules and also results of some short-term assays. 

Since our representation of molecules is based on chemical nomenclature, and 
this has some ambiguity issues we propose to use the notion of multiple-instance 
[11] in lazy learning techniques. Specifically, the ambiguities in chemistry nomen- 
clature stem from the fact that often a single molecule can be described in several 
ways, i.e. it may have synonymous names. The notion of multiple-instance pre- 
cisely captures the idea that an example for a ML technique can have multiple 
descriptions that, nonetheless, refer to the same physical object. Therefore, this 
paper proposes two new techniques for integrating multiple-instances into k-NN 
methods and performs their experimental evaluation in the toxicology domain. 

This paper is organized as follows: first we describe the issues involved in 
representing chemical compounds; then Section 2 presents Shaud, a similarity 
measure for structured cases, and the new multiple-instance techniques for k- 
NN; an empirical evaluation is reported in section 4, and finally a conclusions 
section closes the paper. 
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Fig. 1 . Partial view of the chemical ontology 



2 Representation of the Chemical Compounds 

We propose using a representation of chemical compounds based on the chemical 
ontology used by experts in chemistry. We represent compounds as a structure 
with substructures using the chemical ontology that is implicit in the nomencla- 
ture of the compounds. Fig. 1 shows part of the chemical ontology we have used 
to represent the compounds in the Toxicology data set. This ontology is based 
on the IUPAC chemical nomenclature which, in turn, is a systematic way of de- 
scribing molecules. In fact, the name of a molecule provides all the information 
needed to graphically represent the structure of the molecule. 

According to the chemical nomenclature rules, the name of a compound is 
formed in the following manner: radicals’ names + main group. The main group 
is often the part of the molecule that is either the largest or the part located in 
a central position. However, there is no general rule for forming the compound 
name. Radicals are groups that are usually smaller than the main group. A main 
group can contain several radicals and a radical can, in turn, have a new set of 
radicals. Both main group and radicals are the same kind of molecules, i.e. the 
benzene may be the main group in one compound and a radical in some others. 

In our representation (see Fig. 2) a chemical compound is represented by 
a feature term of sort compound described by two features: main-group and p- 
radicals. The values of the feature main-group belong to some of the sorts shown 
in Fig. 1. The value of the feature p-radicals is a set whose elements are of sort 
position-radical. The sort position-radical is described using two features: radicals 
and position. The value of radicals is of sort compound , as the whole chemical 
compound, since it has the same kind of structure (a main group with radicals). 
The feature position indicates where the radical is bound to the main group. 
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Fig. 2. Representation of TR-339, 2-amino-4~nitrophenol , with feature terms 



For example, the chemical compound TR-339, 2-amino-4~nitrophenol (Fig. 

2) , has a benzene 1 as main group and a set of three radicals: an alcohol in po- 
sition one; an amine in position two; and a nitro-derivate in position four. Note 
that this information has been directly extracted from the chemical name of the 
compound following the nomenclature rules. Moreover, this kind of representa- 
tion is very close to the representation that an expert has of a molecule from the 
chemical name. 

Nevertheless, the chemical nomenclature is ambiguous. For instance, from 
the name 2-amino-4~nitrophenol , chemists assume that the main group of the 
molecule is the benzene and that the radicals are in positions 1, 2 and 4. In 
this molecule the name is clear because the benzene is the largest group and 
chemists have a complete agreement in considering the main group. Neverthe- 
less, the name of some other molecules is not so unambiguous. For instance, the 
chemical compound TR-154 of the toxicology database is the azobenzene (Fig. 

3) a compound with a benzene as main group. This compound is also known as 
diphenyldiimide where the main group is an azo-derivate (structurally equiva- 
lent to a diimide). Therefore, we say that azobenzene and diphenyldiimide are 
synonyms. 

Due to these ambiguities, we propose to take into account synonyms regard- 
ing the structure of the molecule. Thus, the 2-amino-4~nitrophenol has several 
possible synonyms taking into account different positions of the radicals (al- 
though they are not strictly correct from the point of view of the chemical 
nomenclature): we could consider that the amine is in position 1, the alcohol in 
position 2 and the nitro-derivate in position 5. Notice that the difference between 
the synonymous representations is the position of the radicals. 

Dietterich et al. [11] introduced the notion of multiple-instance. This notion 
appears when a domain object can be represented in several alternative ways. 

1 The phenol is a benzene with an alcohol as radical in position one. 
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Fig. 3. Graphical represetation of the molecular structure of azobenzene and two syn- 
onymous descriptions (AZB-1 and AZB-2) of azobenzene 



This situation is very common in domains such as chemistry where a molecule 
can be seen from several points of view. In particular, when addressing the prob- 
lem of determining whether a molecule is active. Multiple instances are needed 
because a molecule can have several conformations some of which can be active 
and some others not. We propose to use the notion of multiple-instance to repre- 
sent the compounds of the toxicology data set. We represented 360 compounds 
of the PTC data set using feature terms. When a compound can have several 
synonymous representations we defined a feature term for each alternative rep- 
resentation, i.e. there are multiple instances for the compound. Fig. 3 shows the 
synonymous representations using feature terms of the azobenzene: one of them 
considers the benzene as the main group and the other considers the azo-derivate 
as the main group. 

Thus, for each one of the 360 chemical compounds of the data set we defined 
as many instances as necessary to capture the different synonyms of a com- 
pound according to its structure. For some compounds, the differences between 
synonyms are the positions of the radicals since in all them we considered the 
same main group. Instead, some other compounds have synonyms with differ- 
ent main group. This is the case of the azobenzene in Fig. 3 where AZB-1 has 
an azo-derivate as main group and AZB-2 has a benzene as main group. As it 
will be explained later, although a compound to be classified is compared with 
all the synonymous descriptions of each compound, the final classification takes 
into account only the similarity with one of the synonyms. In other words, for 
classification purposes the data set contains 360 chemical compounds even most 
of them have several synonymous representations. 

In the next section we explain how k-NN algorithm can be modified in order 
to deal with the synonymous representations of the compounds. 
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Fig. 4. xp 1 and xp 2 are feature terms represented as graphs. xp lu2 is a feature term 
containing both the shared structure (shaded nodes) and the unshared structure (white 
nodes) of xp 1 and xp 2 



3 Similarity of Relational Cases 



In order to assess the toxicity of a chemical compound we proposed the use of 
lazy learning techniques. In particular, we use the k nearest neighbor (k-NN) [10] 
algorithm. Given a new problem p and a case-base B containing solved problems, 
the k-NN retrieves from B the k cases that are most similar to p. There are 
several similarity assessments to be used in the k-NN algorithm [21] but all of 
them work on objects represented as a set of feature value pairs. Nevertheless, we 
represent the chemical compounds as feature terms, i.e. they have a structured 
representation and we proposed Shaud [3] as a similarity measure for relational 
cases represented as feature terms. The main idea of Shaud is to assess the 
similarity between two feature terms taking into account their structure. When 
comparing the structure of two feature terms ip 1 and xp 2 (see Fig. 4), there are 
two parts that have to be taken into account: 1) the part of the structure that 
is common to both ip 1 and xp 2 , called the shared structure (shown by shaded 
nodes in Fig. 4); and 2) the part of the structure that is present in xp 1 but not in 
xp 2 and vice versa, called the unshared structure (shown by white nodes in Fig. 
4). Shaud assesses the similarity of two feature terms xp 1 and xp 2 by computing 
the similarity of the shared structure and then normalizing this similarity value 
taking into account both the shared and the unshared structure. 

Let us suppose that the k most similar cases to the new problem p belong 
to several classes. In such a situation, a common criteria for assessing a solution 
class to p is the majority criterion , i.e. p is classified as belonging to the solu- 
tion class that most of the k of the retrieved cases belong to. We experimented 
with Shaud using the majority criterion but results were not satisfactory enough 
since the accuracy in classifying non-toxic compounds was clearly higher than 
the accuracy in classifying toxic ones. For this reason, we proposed a new clas- 
sification criterion for k-NN called Class Similarity Average (CSA). CSA is not 
domain-dependent and in [3] we proved that it improves the accuracy on both 
toxic and non-toxic compounds. 
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For each compound p to be classified as toxic or non-toxic, Shaud yields the 
similarity between p and each one of the k most similar cases. Then CSA com- 
putes the average of the similarity of the cases in the same class; then the class 
with higher average similarity is selected as the solution for p. More formally, let 
the positive class be the set of chemical compounds that are toxic (or carcino- 
genic) and the negative class the set of chemical compounds that are non-toxic. 
Let A + be the positive retrieval set, i.e. the set containing the retrieved cases 
belonging to the positive class, and A~ be the negative retrieval set, i.e. the set 
containing the retrieved cases belonging to the negative class. The carcinogenic 
activity of a compound p is obtained according to the CSA criterion, where the 
average similarity for both retrieval sets is computed as follows: 

sim + = prryEc i£ A+ s i ancl sim ~ = pFT ^ CiGA- S i 

and then the compound p is assigned to one of the classes according to the 
decision rule : 



if sim + < sim then p belongs to the positive class 
else p belongs to the negative class 



3.1 Lazy Learning Techniques with Multiple-Instances 

The CSA criterion assumes that the k most similar cases are different chemi- 
cal compounds. Nevertheless, this assumption is not true when using multiple- 
instances since some of the retrieved cases can be, in fact, different representa- 
tions of the same compound. For instance, Fig. 5 . a represents a situation where 
P is the new problem to classify and k = 3 . Cases ci and C2 and C3 are the three 
cases most similar to P with similarities Si , S2 and S3 respectively. Ci is the most 
similar to P and C3 is the least similar. Let us assume that C\ and C3 belong to 
the positive class and C2 belongs to the negative class. The classification of P 
can be done using the CSA criterion and the decision rule as explained above. 
Fig. 5 .b shows a situation where Ci and C3 are synonymous (they have the same 
shape in the figure). Therefore, for k = 3 we have two cases (since two of them 
are synonyms); clearly, we cannot treat this situation as identical to that of 5 . a. 
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Notice that, since Ci and C3 are synonyms we have two similarity values (si and 
S3). How can we now decide whether P is positive or negative? 

Let us now consider the synonymy relation (=) among the set of retrieved 
cases A = A + U A~ . Assume, for instance, that A + (or equivalently A~) has 
a pair of synonymous cases c = c' . We can build a reduced retrieval set A + 
without synonyms simply by selecting one of the synonymous and discarding 
the other; i.e. we could take as the reduced retrieval set either A + = A + \ {c} 
or A + = A + \{c'}. Now we introduce two techniques, Shaud -MI max and Shaud- 
MI av , to deal with multiple-instances using reduced retrieval sets A + and A~. 

The technique Shaud -MI max selects the synonymous case in the retrieval 
set A + (resp. A~) with greatest similarity value and discards the others. For 
instance, if c = c' and they have similarity values s and s' respectively, if s > s' 
then c is selected and thus the reduced retrieval set is A + \ {c'}. Let us call the 
synonymous case c with maximal similarity value the canonical representative of 
a collection of synonyms c\ = C2 = . . . = c m and let s = max(s 1, S2, . . . , s m ) be 
its similarity. Clearly, if a case c has no synonyms in A + then c—c and s = s. 
We will define the reduced retrieval set A + as the collection of canonical cases of 
A + . The same process is used for obtaining A~ from A~. Finally, the solution 
class is computed by modifying the CSA criterion as follows: 

sim+ = JI + T ^ ancl sim ~ = W~\ ^ ^ w 

' 'ciGA+ ' 'c^A- 

and the same CSA decision rule ( sim + < sim~) is used as before. 

For instance, in the situation shown in Fig. 5 .b if the synonyms C\ and C3 
belong to the positive class, then A+ = {ci}, |A+| = 1 , and s"i = max{s\, S3) = 
Si. Analogously, if C2 belongs to the negative class we will have that S2 = S2 and, 
following the CSA decision rule, P will be classified as positive since Si > s 2 and 
thus sim + > sim~ . 

The technique Shaud-M I av is similar to the previous one except that it uses 
an average criterion instead of the maximum criterion. Thus, for any collection 
of synonyms C\ = C2 = ... — c m in a retrieval set their average similarity 
s = ^(si + S2 T . ■ . + s m ) is computed. Let the canonical synonymous case c 
be a randomly chosen case from a set of synonymous cases ci — C2 — ... — c m . 
As before, if c has no synonyms on A + then c = c and s = s. Let A + be the 
reduced retrieval set with the canonical cases of A + , and for each Cj £ A + let 
Si be the average synonymous similarity computed as indicated above, then the 
CSA average similarity is again computed as in expression ( 1 ) with the same 
decision rule as before. 

For instance, in the situation show in Fig. 5 .b if the synonymous Ci and C3 
belong to the positive class, then A + = {ci} (i.e. |A+| = 1 ), and s"i = . 

Following the CSA decision rule, P will be classified as positive when sim + > 
sim~ and negative otherwise. 
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Table 1. Distribution of the NTP compounds on the four data sets 



data set 


Positive 


Negative 


Equivocal 


Inadequate 


Unknown 


MR 


127 


176 


39 


6 


12 


FR 


101 


205 


35 


7 


12 


MM 


102 


195 


37 


13 


13 


FM 


124 


198 


19 


7 


12 



Table 2. Accuracy results in the four toxicology data sets for S ha Lid similarity with 
three aggregation criteria CSA, MIm.ax, and MI av 







MR 


FR 


MM 


FM j 


Shaud 


k 


Acc 


TP 


FP 


Acc 


TP 


FP 


Acc 


TP 


FP 


Acc 


TP 


FP 


CSA 


3 


54.43 


.522 


.431 


61.77 


.463 


.319 


58.47 


.428 


.329 


56.16 


.438 


.368 




5 


54.66 


.560 


.466 


58.63 


.520 


.373 


58.83 


.491 


.353 


57.97 


.512 


.377 


1 max 


3 


58.37 


.517 


.362 


64.86 


.461 


.257 


59.42 


.403 


.315 


57.21 


.445 


.346 




5 


59.28 


.515 


.343 


64.54 


.498 


.285 


57.62 


.443 


.352 


56.34 


.496 


.394 


MI av 


3 


57.39 


.505 


.363 


64.73 


.458 


.256 


59.26 


.439 


.302 


57.25 


.474 


.362 




5 


58.15 


.549 


.355 


63.85 


.466 


.274 


56.05 


.423 


.372 


56.47 


.483 


.383 



4 Experiments 

In our experiments we used the toxicology data set provided by the NTP. This 
data set contains around 500 chemical compounds that may be carcinogenic for 
both sexes of two rodents species: rats and mice. The carcinogenic activity of the 
compounds has proved to be different for both species and also for both sexes. 
Therefore, there are in fact four data sets. 

We solve the predictive toxicology problem as a classification problem, i.e. 
for each data set we try to classify the compounds as belonging to either the pos- 
itive class (carcinogenic compounds) or to the negative class (non-carcinogenic 
compounds) . We used 360 compounds of the data set (those organic compounds 
whose structure is available in the NTP reports) distributed in the classes as 
shown in Table 1. 

The experiments have been performed with the k-NN algorithm using Shaud 
as distance and taking Shaud- MI av and Shaud- Afl maa; explained in the previous 
section. Results have been obtained by the mean of seven 10-fold cross-validation 
trials. Table 2 shows these results in terms of accuracy and true positives (TP) 
and false positives (FP) for both options and also we compare them with the 
version of CSA without multiple-instances. Concerning the accuracy, the versions 
with multi-instances taking k = 3 improve the version without multi-instances, 
especially in MR and FR data sets. Nevetlreless, taking k = 5, the versions 
with multi-instances are better on rats (i.e. MR and FR) but the accuracy does 
not improves on mice (i.e MM and FM). We are currently analyzing why the 
prediction task on mice is more difficult than in rats. 

Currently machine learning methods are evaluated using ROC curves [13]. A 
ROC curve is a plot of points (FP, TP) where TP is the ratio between positive 
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cases correctly classified and the total number of positive cases; and FP is the 
ratio between negative cases incorrectly classified and the total number of nega- 
tive cases. The line x = y represents the strategy of randomly guessing the class 
and the point (0, 1) represents perfect classification. Points above the diagonal 
are preferred since they represent a higher number of TP than FP. Thus, a point 
is better than another if TP is higher and FP is lower. Moreover, given two 
points (FPi,TPi) and {FP 2l TP 2 ) such that FP\ < FP 2 and TP\ < TP 2 the 
performance of the two methods is incomparable and the cost of false positives 
has to be taken into account in order to choose between them. The convex hull 
of a set of points is the smallest convex set that includes the points. Provost 
and Fawcett [18] introduced the notion of convex hull in the ROC curves as a 
way to compare machine learning methods. They prove that (FP, TP) points on 
the convex hull correspond to optimal methods whereas those points under the 
convex hull can be omitted since they never reach an optimal performance. 

We will use the ROC convex hull to compare Shaud-CSA, Shaud -MI max and 
Shaud-M/ a „ to the best methods of the PTC. According to the final conclusions 
of the PTC ([20]) best methods for each data set are the following: 

— MR. Gonzalez [14] 

— FR. Kwansei [17], Viniti [6] 

— MM. Baurin [5], Viniti, Leuven [7] 

— FM. Viniti, Smuc (from [20]) 

Figures 6 and 7 show the ROC points of the methods above for all the data sets. 
We included in these figures the points corresponding to the Shaud versions. 

4.1 Discussion 

Concerning the MR data set, the methods of Kwansei (3), Gonzalez (2) and 
Viniti (6) are in the convex hull of the PTC, so they are the best methods for 
this data set (if we do not take into account the cost). Shaud-CSA (7 and 8), 
Shaud -MI max (9 and 10) and Shaud-M/ a „ (11 and 12) are above the convex 
hull (for both k = 3 and k = 5). Taking the points separately we see that 
Shaud -MI max (10) and Shaud-M/ a „ (12) both with k = 5 clearly improve the 
performance of Viniti and Smuc (5) method in the central zone. From our point of 
view, Shaud -MI max is incomparable with Gonzalez and Kwansei methods since 
it increases the number of TP but also increases the number of FP. Therefore, 
choosing between these methods will depend on the cost assigned to the FP. 

The best methods of the PTC for the FR data set are Viniti (6) and Kwansei 
(3). With respect to the convex hull, our methods do not perform very well 
but looking separately at the points we consider that Viniti and Kwansei are 
incomparable since Viniti produces few FP but also few TP. Instead, Kwansei 
clearly produces more TP. Choosing between these two methods depends on the 
cost of the FP and also on the necessity to detect as many TP as possible. In 
this sense, our methods are close to Kwansei. In particular Shaud-M/ at , with 
both k = 3 and k = 5 is the best approach. 
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Fig. 6. ROC curves of 12 methods for MR and FR data sets 
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Fig. 7. ROC curves of 12 methods for MM and FM data sets 



Concerning the MM data set, the best methods of the PTC are Viniti (6) 
and Leuven (4). The Viniti method is really excellent because the number of FP 
is low and the number of TP is high enough. Nevertheless, the Leuven method 
produces more TP although the number of FP is also very high. Our methods 
are in an intermediate position, near to the Baurin (1) method. In particular, 
any of the multiple-instances versions with any k has a number of TP near to 
that of Baurin but with fewer FP. All versions with k = 3 improve the Baurin 
method whereas CSA without multi- instances (8) and Shaud -MI max (10) with 
k = 5 have higher TP but also higher FP. Shaud -MI av has approximately the 
same number of TP but the higher number of FP. 
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Finally, concerning the FM data set, Viniti (6) and Leuven (4) methods are 
on the convex hull. Nevertheless we consider that the Leuven method is not so 
good since it is near to the (1,1) point. Our methods are near to Baurin (1), 
Kwansei (3) and Smuc (5). Smuc method is better than Kwansei since both 
have aproximately the same number of TP but Kwansei produces more FP. We 
consider that all our methods improve the Baurin method since the (FP, TP) 
points are on the left-hand side of Baurin (i.e the number of FP is lower) and 
all the versions with k = 5 produce more TP. The choice between any of our 
methods and Viniti or Smuc clearly depends on the cost of the FP. 

Summarizing, establishing a cost measure is necessary in order to mean- 
ingfully choose the adequate methods for each data set. Nevertheless, our lazy 
approach using multiple-instances has proved to be competitive enough. A fi- 
nal remark is that most of the best methods use many information about the 
domain. Moreover methods based on the SAR representaton produce toxicity 
models in terms of molecular features that sometimes are not easy to determine. 
The Viniti method uses a domain representation that takes benefit ot the molec- 
ular structure, nevertheless this representation and also the toxicity model are 
difficult to understand. Instead, we used a representation close to the chemical 
nomenclature. In this representation we only taken into account the molecular 
structure without any additional feature. Our conclusion is that having only 
structural information is enough to obtain a comparable performance and it is 
not necessary to handle features that are neither intuitive nor easy to compute. 



5 Related Work 

The notion of multiple-instances is useful when domain objects can be viewed 
in several ways. Specifically, Dietterich et al. [11] used multiple-instances for 
determining the activity of a molecule, taking into account that a molecule 
has different isomers with different activity. As explained in section 2, chemical 
nomenclature allows synonym names for one compound. We intend to use the 
notion of multiple-instances to manage synonymous descriptions of compounds. 

The basic idea of multiple-instances is that a domain object can be repre- 
sented in several alternative ways. Chemistry is an application domain where 
multiple-instances can be applied in a natural way since the molecular struc- 
ture of a compound has several possible configurations with different properties 
(e.g. a configuration may be active whereas another is inactive). Most of authors 
working on multiple-instances use chemical domains such mutagenesis [19] and 
musk (from the UCI repository). Dietterich et al. [11] introduced the notion of 
multiple-instance and they extended the axis-parallel rectangle method to deal 
with it. Other authors then proposed extensions of some known algorithms in 
order to deal with multiple-instances. 

Chevaleyre and Zucker [8] proposed an extension of propositional rule learn- 
ing. Specifically, they proposed two extensions of the RIPPER method [9]: 
NAIVE-RIPPERMI, that is a direct extension, and RIPPERMI which performs 
relational learning. Maron and Lozano-Perez [16] introduced a probabilistic mea- 
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sure, called diverse density , that computes the intersection of the bags (sets of 
synonymous objects) minus the union of the negative bags. Maximizing this mea- 
sure they reach the goal concept. Zucker [22] introduced the multi-part problem 
meaning that an object can be represented by means of a set of descriptions of 
its parts. They propose extensions of the classical concept learning algorithm by 
defining a multiple entropy function and a multiple coverage function. 

There are also some approaches using multiple-instances with a lazy learning 
approach. Wang and Zucker use the k-NN algorithm with the Hausdorf distance 
[12] defined to assess the distance between two bags. They introduce two ver- 
sions of k-NN: Bayessian k-NN, which uses a Bayesian model to vote the final 
classification; and citation k-NN where the different bags are related in the same 
way as the references on information science. 



6 Conclusions 

In previous work we have shown that using both a chemical ontology based rep- 
resentation of the compounds and a lazy learning approach is a feasible approach 
for the predictive toxicology problem. However, our approach was limited by the 
fact that the ontology we were using (namely the chemical nomenclature) allows 
multiple descriptions of single compounds. Since the PTC data set we were using 
only used one description for each compound the selection of that description 
over the other ones introduced an unwanted and unknown bias. In fact, using 
Shaud similarity compared two compound descriptions but not the alternative 
descriptions that were not included in the data set; therefore results could be 
different if the selected descriptions were different. 

Therefore, our purpose as explained in this paper was to use multiple de- 
scriptions when meaningful, but it was not enough to expand the PTC data set 
to allow every example to have several compound descriptions: we needed to 
define how multiple compound descriptions would be interpreted by lazy learn- 
ing methods. In this paper we have introduced the notion of reduced retrieval 
sets to integrate Dietterich’s notion of multiple-instances into k-NN techniques. 
Specifically, we consider that k-NN retrive k cases similar to a problem P and, 
for each class to wlrclr P can be assigned, a retrieval set can be built from the 
retrieved cases of that class. 

We presented two methods for dealing with multiple-instances, Shaud -MI max 
and Shaud-M/ a „, that specify how reduced retrieval sets are built from classi- 
cal k-NN retrieval sets. This building process is, in fact, a specification of how 
to interpret the fact that more than one description of a specific compound 
are in the k-NN retrieval sets. Since Shaud -MI max uses the maximal similarity 
among synonyms, the interpretation is that we only take into account the synony- 
mous description that is the most similar disregarding the others. Nevertheless, 
multiple-instances are useful since they allow to find more similar matches in 
the k-NN retrieval process. 

On the other hand, Shaud-M/ at , uses the average similarity among retrieved 
synonyms, thus in some way penalizing multiple-instances that have a second 
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most similar compound description with a lower similarity value. Recall that 
both techniques normalize the aggregate similarities of the retrieval sets with 
the number of retrieved examples (i.e. not counting synonyms), and therefore a 
k-NN retrieval set contains k cases that represent different chemical compounds 
(as it would be without multiple-instances, which is exactly what Shaud-CSbl 
does). 

The experiments have shown that introducing multiple-instances improves 
the performance of lazy learning in general terms. Specifically, using multiple- 
instances improves results in the rats (both male and female) data sets, while in 
the mouse data sets using multiple-instances or not gives incomparable results 
(a cost measure would be needed to decide the best among them). Notice also 
that our lazy learning techniques are more competitive, when compared with 
other ML methods, in the rats data sets, while they are not distinguishable from 
other methods in the mouse data sets. Although the reasons for the differences 
in performance for lazy learning (and for the other ML methods) on the PTC is 
not well understood (see [15]) it seems that multiple-instances can be useful for 
the situations where a lazy learning method is adequate, as in the data set for 
male and female rats. 

The representation of the chemical compounds in our experiments use only 
structural information. Instead, representations based on SAR use features for 
which computation is imprecise and which are not totally comprehensible by 
the expert. In the future, we plan to extend our representation to introduce 
information about short-term experiments. In particular, the Ames test [1] that 
has proved to be very important result and is easy to obtain. 
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Abstract. This paper describes the problem of modelling toxicity of environ- 
mental pollutants using molecular descriptors from a systems theoretical view- 
point. It is shown that current toxicity modelling problems systematically incor- 
porate very high levels of noise a priori. By means of a set of individual and 
combined models self-organised by KnowledgeMiner from a high-dimensional 
molecular descriptor data set calculated within the DEMETRA project we sug- 
gest a way how results interpretation and final decision making can effectively 
take into account the huge uncertainty of toxicity models. 



1 Introduction 

The global production of chemicals has increased from 1 million tonnes in 1930 to 
400 million tonnes today. There are about 100.000 different substances registered in 
the EU market of which 10.000 are marketed in volumes of more than 10 tonnes, and 
a further 20.000 are marketed at 1-10 tonnes. 

Besides the economical importance of the chemical industry as Europe’s third larg- 
est manufacturing industry, it is also true that certain chemicals have caused serious 
damage to human health resulting in suffering and premature death and to the envi- 
ronment. The incidence of some diseases, e.g. testicular cancer in young men and 
allergies, has increased significantly over the last decades. While the underlying rea- 
sons for this have not yet been identified, there is justified concern that certain chemi- 
cals play a causative role for allergies. 

The present system for general industrial chemicals distinguishes between "exist- 
ing substances" i.e. all chemicals declared to be on the market in September 1981, and 
"new substances" i.e. those placed on the market since that date. There are some 2.700 
new substances. Testing and assessing their risks to human health and the environ- 
ment according to Directive 67/548 are required before marketing in volumes above 
10 kg. In contrast, existing substances amount to more than 99% of the total volume 
of all substances on the market, and are not subject to the same testing requirements. 
In result, there is a general lack of knowledge about the properties and the uses of 
existing substances. The risk assessment process is slow and resource-intensive and 
does not allow the system to work efficiently and effectively [1]. 
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To address these problems and to achieve the overriding goal of sustainable devel- 
opment one political objective formulated by the European Commision in its White 
Paper [1 ] is the promotion of non-animal testing, which includes: 

• Maximising use of non-animal test methods; 

• Encouraging development of new non-animal test methods; 

• Minimising test programmes. 

A current way in that direction is building mathematical. Quantitative Structure- 
Activity Relationship (QSAR) models based on existing test data that aim on describ- 
ing and predicting the short-term, acute impact of a chemical compound (pollutant) on 
the health of a population of a certain biological species. This impact can either be 
direct by injection or feeding or indirect by introducing a specific concentration of a 
chemical into the environment (air, water, soil). Representative for expressing the 
chemicals’s impact on the population’s health the lethal dose LD J0 or the lethal con- 
centration LC 50 (toxicity) is measured correspondingly. LC 50 , for example, specifies 
the experienced concentration of a chemical compound where 50% of the population 
died within a given time after intoduction of the chemical to the system. 

In this work the Group Method of Data Handling (GMDH) [2] is used as a very ef- 
fective and valuable modelling technology for building mathematical models and 
predictions of the lethal concentration. 



2 The Problem of Modelling Toxicity 

2.1 Systems Analysis 

Generally, real-world systems are time-variant nonlinear dynamic systems [3]. There- 
fore, it should be useful to allow the modelling algorithm to generate systems of 
nonlinear difference equations. For toxicity modelling this system can be considered 
time-invariant due to the intentionally short-term effect of the pollutant. 

A possible dynamic model of the ecotoxicological system is shown in figure 1, 




where 

x(t) - state vector of the ecological system at time t, 
u(t) - vector of external variables at time t, 
c v (t) - concentration of the pollutant v at time t, 

Zj(t), z 2 (t) - external disturbances to the system at time t, 
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y(t) - output vector of dimension p describing the health of the population at time t, 
y(t)=[y 1 (t), y 2 (t),.„ y m (t), y p (t)] T 

y m (t) - the population’s cumulated mortality rate at time t (see also fig. 3). 

This dynamic model is described by the following system of equations: 
x(t+l ) = G(x(t), u(t), c v (t), Zj(t), z 2 (t)) 

w(t) = Hj(x(t), u(t), c v (t), Zj(t)) (1) 

y(t) = H 2 (w(t), z 2 (t)) = H*(x(t), u(t), c v (t), z j ( t) , z 2 (t)) 



with c v (t) 




t=t 0 

else 



and c 0 as the concentration of the test compound v in mg/1. 



During the animal tests, however, the external variables u(t) and the state variables 
x(t) of the system are not observed, usually, or not observable and therefore they are 
considered constant so that for modelling the ecotoxicological system transforms into 
a nonlinear static system (fig. 2): 




Fig. 2. Reduced model of the static system with noise z G = hj(Zj, z 2 ) 

Additional noise z 3 is introduced to the static system by the missing information of 
external and state variables that now transforms to noise. Also the testing procedure 
itself adds some noise z 4 so that the static system’s noise finally is z s = h,(z G , z 3 , z 4 ), 
and the modelling task of the ecotoxicological system reduces to approximating the 
dependence of the experienced mortality rate y from the pollutant’s concentration c v : 

y = f i( c v- z s)- ( 2 ) 

If an animal experiment is repeated several times using the same concentration u 
of a chemical test compound v multiple experienced mortality rate values y c . are 
available (fig. 3). This means, for c iv = const., the interval of the observed mortality 
rate values y c . v can be seen as a direct expression of the static system’s noise z s . For 
the reverse case of measuring the concentration c v for a constant mortality rate y. = 
const, the problem transforms to 

c v = f 2 (yp z s) ( f ig- 3)- ( 3 ) 

For yj = 50%, c v is the experienced lethal concentration LC 50 for a pollutant v, which 

is actually used as the output variable in toxicity QSAR modelling. With a commonly 
c 

observed rate v,nMX = 4 for a single compound v this output variable can be seen as 
c 

v,min 



highly noisy. 
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Fig. 3. Variation of LC 50 resulting from a number of comparable tests 

The initial task of modelling the observed mortatlity rate y from a pollutant’s concen- 
tration c v now shifts to finding a description of the dependence of a pollutant’s lethal 
concentration LC 50 for a specific species from the pollutant’s molecular structure s v 
(fig. 4): 

LC 50 = f 3 ( S v> Z m)’ with Z M = h 3( z s) (4) 




Fig. 4. The toxicity modelling problem as applied in practice. Note that the input variable c y 
(LC 50 ) of the initial ecotoxicological system (fig. 1 and 2) has shifted to now being the objec- 
tive of modelling 

This finally means not to model the object itself - the ecotoxicological system - but 
one of its inputs - the external disturbance c v . The initial system’s input-output rela- 
tion is mapped by just a single pair of observations (LC 50 , y) so that it is described by 
a linear relationship a priori. 

A next problem is how to express the structure s v of the chemical v. Commonly, it 
is a complex chemical object, but for building a mathematical model that describes 
the dependence of the toxicity from the chemical structure a formal transformation 
into a set of numerical properties - descriptors - is required. This transformation is 
based on chemical and/or biological domain knowledge implemented in some soft- 
ware (fig. 5): 



d v = f 4< S v’ z t) 



(5) 
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Fig. 5. Model of the chemical structure to molecular descriptor transformation 

In the chemical domain, for example, input of the software system can be a 2- 
dimensional or a 3 -dimensional drawing of the chemical structure, but also SMILES 
coded strings or other expressions may be possible. Output of the system is a certain 
set of molecular descriptors depending on the software used and the theoretical model 
implemented. Applying different software provides different sets of descriptors that 
may intersect to some extent but may not necessarily have identical values though. 
Also, the interpretational power of descriptors can be low or difficult when they loose 
chemical meaning. 

The process of descriptor calculation also adds noise. Not only software bugs or 
manual failures may introduce noise, more important for introduction of uncertainty 
should be the interpretational clearance of domain knowledge for properly formalis- 
ing an appropriate set of molecular descriptors, different starting condition assump- 
tions (conformation) for descriptor calculation, or several different optimisation op- 
tions. Not always is their chemical meaning very strong or theoretically accounted. 

The final, simplified nonlinear static model used in QSAR modelling to describe 
acute toxicity is (fig. 6): 




Fig. 6. Simplified model for describing acute toxicity 



with 

LC 50 = z tX z m) = f K Z T’ Z M>’ (6) 

LC 50 - experienced lethal concentration for a certain species and chemical compound, 
s v - the structure of the tested chemical compound in the chemical domain, 
z T - noise of the chemical structure to molecular descriptor transformation process, 
z M - noise transformed from the ecotoxicological test system, 
d v - vector of numerical molecular descriptors of the test compound 

The external disturbance z T which adds noise to descriptor input space used for 
modelling can be reduced by fixing bugs and manual failures and by finding a most 
consistent chemical structure to descriptor transformation - although it is not clear a 
priori which transformation or optimisation will add and which will reduce noise. The 
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disturbance z M , which finally results from the experimental tests, in contrast, adds 
noise to the output LC 50 and is a given fact that cannot be changed afterwards. 



2.2 Modelling Methods 

Apparently, toxicity QSAR modelling implies dealing with very noisy data. Data sets 
are generally not perfect reflections of the world. The measuring process necessarily 
captures uncertainty, distortion and noise. Noise is not errors that can infect data but 
part of the world. Therefore, a modelling tool, but also results and decisions, must 
deal with the noise in the data. Information about the noise dispersion can be useful 
for choosing adequate modelling technologies by referencing the ideas of Stafford 
Beer’s adequacy law [4]: The “black boxes” of the objects have to be compensated by 
corresponding “black boxes” in the information or control algorithm. Based on this 
idea, the following general classification of modelling algorithms is suggested in [2]: 
For a small level of noise dispersion, all regression-based methods using some inter- 
nal criterion can be applied: 

• GMDH with internal selection criteria, 

• Statistical methods, or 

• Neural Networks. 

For considerably noisy data - which always includes small data samples - GMDH 
or other algorithms based on external criteria are preferable. For a high level of noise 
dispersion, i.e., processes that show a highly random or chaotic behavior, finally, 
nonparametric algorithms of clustering. Analog Complexing, or fuzzy modelling 
should be applied to satisfy the adequateness law. This implies also that with increas- 
ing noise in the data the model results and their descriptive language become fuzzier 
and more qualitative. 

There is a broad spectrum of possible algorithms to use, because it is not possible 
to define the characteristics of the controlled object in advance, exactly. Therefore, it 
is helpful to try several modelling algorithms, first, and then decide which algorithms 
suit the given type of object best or most appropriately combine the results of differ- 
ent modelling runs in a hybrid model. In QSAR modelling, for several reasons, pre- 
dominantely algorithms have been used for modelling linear static systems (linear 
regression, PLS, especially), which is an additional significant simplification of the 
highly disturbed ecotoxicological system model. One reason surely is connected with 
problems in creating and validating reliable descriptive and predictive nonlinear mod- 
els. Even in cases where it was possible to create to some meaning good predictive 
nonlinear models (Neural Networks) - not looking at the special validation require- 
ments of nonlinear models in general - they commonly have no or only low descrip- 
tive power which, however, turns out being an important feature for applicability and 
acceptability in real-world scenarios. Users usually don’t want to rely decisions on 
kind of “black boxes”. Due to the large noise level in toxicity modelling descriptive 
power might also be part of the model evaluation procedure, because models that can 
be interpreted from a theoretical viewpoint can be judged using domain knowledge. 
Another reason for preferring linear models in toxicity QSAR modelling is the high- 
dimensional descriptor space and/or the comparingly low number of tested com- 
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pounds, which always implies state space dimension reduction. Linear approaches are 
widely used here in preprocessing to obtain a small set of “best” descriptors, where 
“best” then relates to building linear models. 

2.3 Modelling Technologies Used 

2.3.1 High-Dimensional Modelling 

A new approach to high-dimensional state space modelling we have been developing 
and using is based on multileveled self-organisation. The basic idea here is dividing 
high-dimensional modelling problems into smaller, more manageable problems by 
creating a new self-organising network level composed of active neurons, where an 
active neuron is represented by an inductive learning algorithm (lower levels of self- 
organisiation) applied to disjunct data sets. The objective of this approach is based on 
the principle of regularisation of ill-posed tasks, especially the requirement of defin- 
ing the actual task of modelling a priori to be able to select a set of best models. In the 
context of a knowledge discovery from databases, however, this also implies using 
this principle in every stage of the knowledge extraction process - data preselection, 
preprocessing including dimension reduction, modelling (data mining), and model 
evaluation - consistently. The proposed approach of multileveled self-organisation 
integrates preprocessing, modelling, and model evaluation into a single, automatically 
running process and it therefore allows for directly building reliable models from 
high-dimensional data sets (up to 30.000 variables) objectively. The external informa- 
tion necessary to run the new level of self-organisation is provided by the correspond- 
ing algorithm’s noise sensitivity characteristic as explained in [5, 6]. 

2.3.2 Inductive Learning Algorithm 

The inductive learning algorithm we used in this work in the network’s active neurons 
is the Group Method of Data Handling (GMDH) as described in more detail in [2], 
The theory of GMDH Neural Networks was first developed by A.G. Ivakhnenko [7, 
8] in 1968 based on Statistical Learning Network theory and on the principle of in- 
duction, where induction consists of 

• The cybernetic principle of self-organization as an adaptive creation of a network 
without subjective points given; 

• The principle of external complement enabling an objective selection of a model of 
optimal complexity and 

• The principle of regularization of ill-posed tasks. 

This different foundation compared to traditional Backpropagation Neural Networks 
allows for autonomous and systematical creation of optimal complex models by em- 
ploying both parameter and structure identification. An optimal complex model is a 
model that optimally balances model quality on a given learning data set ("closeness 
of fit") and its generalisation power on new, not previously seen data with respect to 
the data's noise level and the task of modelling (prediction, classification, modelling, 
etc.). It thus solves the basic problem of experimental systems analysis of systemati- 
cally avoiding "overfitted" models based on the data's information only. This makes 
GMDH a most automated, fast and very efficient supplement and alternative to other 
data mining methods. Also, in result of modelling an analytical model in form of 




228 Frank Lemke, Johann-Adolf Muller, and Emilio Benfenati 

algebraic formulas, difference equations, or systems of equations is available on the 
fly for interpretation and for gaining insight into the system. In our work the GMDH 
implementation of the KnowledgeMiner software was used, exclusively [9]. 

2.3.3 Model Combining 

Another focus is on model combining. There are several reasons to combine models 
or their results [2]: 

1 . All kinds of parametric, nonparametric, algebraic, binary/fuzzy logic models are 
only simplified reflections of reality. There are always several models with a suf- 
ficient same degree of adequacy for a given data sample. However, every model 
is a specific abstraction, a one-sided reflection of some important features of real- 
ity only. A synthesis of alternative model results gives a more thorough reflec- 
tion. 

2. Although models are self-organised, there is still some freedom of choice in sev- 
eral areas due to the regularisation requirement of ill-posed tasks. This freedom 
of choice concerns, for example, the type of model (linear/nonlinear) and the 
choice of some modelling settings (threshold values, normalisation etc.). To re- 
duce this unavoidable subjectivity, it can be helpful to generate several alterna- 
tive models and then, in a third level of self-organisation, improving the model 
outputs by synthesising (combining) all alternative models in a new network. 

3. In many fields, such as toxicology, there are only a small number of observations, 
which is the reason for uncertain results. To improve model results the artificial 
generation of more training cases by means of jittering, randomisation, for exam- 
ple, is a powerful way here. 

4. All methods of automatic model selection lead to a single "best" model while the 
accuracy of model result depends on the variance of the data. A common way for 
variance reduction is aggregation of similar model results by means of resam- 
pling and other methods (bagging, boosting) following the idea: Generate many 
versions of the same predictor/classifier and combine them. 

5. If modelling aims at prediction, it is helpful to use alternative models to estimate 
alternative forecasts. These forecasts can be combined using several methods to 
yield a composite forecast of a smaller error variance than any of the components 
have individually. The desire to get a composite forecast is motivated by the 
pragmatic reason of improving decision-making rather than by the scientific one 
of seeking better explanatory models. Composite forecasts can provide more in- 
formative inputs for a decision analysis, and therefore, they make sense within 
decision theory, although they are often unacceptable as scientific models in their 
own right, because they frequently represent an agglomeration of often conflict 
theories. 



3 Results on Modelling Toxicity of Pesticide Residues 

3.1 The Data Set 



We used a data set calculated within the DEMETRA project [10]. It contains 281 
chemical compounds - pesticides - and given corresponding experienced lethal con- 
centrations LC 50 for trout. 1061 2D molecular descriptors were calculated by different 
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commercial or publicly available software. This descriptors set is highly redundant so 
that by means of clustering a non-redundant nucleus of 647 potential 2D descriptors 
showing a diversity of at least 2% was obtained. 46 chemical compounds were hold 
out for out-of-sample testing (N c ) of the generated models so that 235 pesticides were 
used for modelling ( N A B ). 



3.2 Individual Models 



A set of 13 different linear and non-linear QSAR models M } to M I3 was self- 
organised directly from this data set by the KnowledgeMiner data mining software 
[9]. The necessary workflow of accessing data from the database, preprocessing 
(missing values detection, data transformation), and modelling (data mining) was 
automated by applying AppleScript integrating various software tools running under 
Mac OS X in that way (MS Excel, MS Word, TextEdit, Valentina DB, AppleWorks, 
KnowledgeMiner) . 

For each model we calculated three different model performance measures: 
Descriptive Power (DP) as described in [5], the Coefficient of Determination (R 2 ), 
and the Mean Absolute Percentage Error (MAPE) as follows: 

Xo.-w 2 

R 2 =l-8 2 ,S 2 = Ap _<i, 

2/.v>- 7) (7) 

ieN 



MAP£ = ^t j-j— x100%, 



( 8 ) 



where y t , y t , and y are the true, estimated, and mean values of the output variable, 

respectively, and 8" is the Approximation Error Variance criterion [2], 

The corresponding results are listed in table 1 . 



Table 1 . Performance parameters for 13 individual models self-organised by KnowledgeMiner 





Calculated on N A B 


Calculated on N c 


Calculated on N A B c 


MODEL 


R 2 


DP 

[%] 


MAPE 

[%] 


R 2 


MAPE 

[%] 


R 2 


MAPE 

[%] 


Ml (linear) 


0,69 


43 


28 


0,54 


34 


0,67 


28 


M2 (linear) 


0,71 


44 


28 


0,42 


37 


0,66 


29 


M3 (nonlinear) 


0,71 


40 


26 


0,49 


34 


0,68 


28 


M4 (nonlinear) 


0,74 


43 


25 


0,41 


37 


0.63 


28 


M5 (nonlinear) 


0,68 


40 


n.a. 


0,31 


47 


0,62 


31 


M6 (linear) 


0,71 


45 


26 


0,36 


40 


0,64 


30 


M7 (linear) 


0,71 


45 


26 


0,33 


42 


0,63 


31 


M8 (nonlinear) 


0.76 


47 


23 


0,30 


39 


0,66 


28 


M9 (nonlinear) 


0,75 


46 


24 


0,21 


43 


0,64 


29 


M10 (linear) 


0,70 


45 


27 


0,58 


31 


0,68 


28 


M 1 1 (linear) 


0.69 


44 


28 


0,54 


33 


0,66 


29 


M12 (nonlinear) 


0,72 


44 


26 


0,49 


33 


0,68 


28 


Ml 3 (nonlinear) 


0.76 


48 


25 


0,42 


37 


0,69 


28 
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3.3 The Combined Model 

Finally, a combined model M comb out of the 1 3 individual models was generated like- 
wise. The combined model is built on the predicted toxicity values of the individual 
models to M 13 as input information. To introduce new independent information for 
this second model optimization level, all chemical compounds of the initial data set 
including those hold-out for testing were used for modelling so that all 281 com- 
pounds built the learning data set here (N AB ). This is possible and reasonable, because 
the modelling task is set to work under conditions for which the generalization power 
of the external cross-validation selection criterion of the GMDH algorithm [2] works 
properly according to the algorithms ’s noise sensitivity characteristic [5, 6]. Table 2 
shows the performance improvements of the combined model. 



Table 2. Performance parameters for the combined model 







Calculated on N A B 


MODEL 


R 2 


MAPE [%] MAPE [%] 


M comb ( linear > 


0,76 


50 25 



The self-organised model equation for Af y = f t (M 5 , M 1Cp M n , M n , M 13 ), is: 

Lg(LC 50 [mmol/1]) = 0.131 - 0.243 M u + 0.242 M 5 + 0.384 M w (9) 

+ 0.301 M 12 + 0.364 M 13 

and it is finally composed of 53 different descriptors. 

3.4 Model Uncertainty and Prediction Interval 

As pointed out in section 2, toxicity data are highly noisy and therefore require ade- 
quate modelling and results interpretation methods. Additionally, all methods of 
automatic model selection lead to a single “best” model. On this base are made con- 
clusions and decisions as if the model was the true model. However, this ignores the 
major component of uncertainty, namely uncertainty about the model itself. In toxic- 
ity modelling it is not possible that a single crisp prediction value can cover and re- 
flect the uncertainty given by the initial object’s data. If models can be obtained in a 
comparingly short time it is useful to create several alternative reliable models on 
different data subsets or using different modelling methods and then to span a predic- 
tion interval from the models’ various predictions for describing the object’s uncer- 
tainty more appropriately. In this way a most likely, a most pessimistic (or most save), 
and a most optimistic (or least save) prediction is obtained, naturally, based on the 
already given models only, i.e., no additional (statistical) model has to be introduced 
for confidence interval estimation, for example, which would had to make some new 
assumptions about the predicted data, and therefore, would include the confidence 
about that assumptions, which, however, is not known a priori. 
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A prediction interval has two implications: 

1. The decision maker is provided a set of predicted values that are possible and 
likely representations of a virtual experimental animal test including the uncer- 
tainty once observed in corresponding past real-world experiments. The decision 
maker can base its decision on any value of this interval according to importance, 
reliability, safety, impact or effect or other properties of the actual decision. This 
keeps the principle of freedom of choice for the decision process. 

2. Depending on which value used, a prediction interval also results in different 
model quality values starting from the highest accuracy for most likely predic- 
tions. 

Figure 7 displays the prediction intervals for test set compounds ( N c ) from the 
models contained in the combined model M comb reported in 3.3. 

Prediction interval test set 




In a real-world application scenario evaluation and decision-making can only base 
on predictions; no experienced toxicity value is given, usually, except those available 
from past tests. A supplement to providing prediction intervals that covers model 
uncertainty for decision making from another perspective can be the following ap- 
proach: 

1. For N compounds create a list of pairs (y i ,y J ) with y\ as the observed toxicity 
for a compound i and y, as the predicted toxicity for a compound i. N preferably 
equals the total number of compounds available for a data set, i.e., learning and 
testing data. The estimated/predicted values % can be any values of the predic- 
tion interval, minimum, maximum, mean, for example. 

2. Sort the matrix [y vj with respect to column y. 

3. Create q equidistant intervals (classes) based on y. 
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The result is q disjoint classes of corresponding observed and estimated toxicity 
values. For each class j, j=l, 2, q, the estimated toxicity mean and the minimum, 
maximum, and mean of the observed toxicities can be calculated. This means that 
here an interval of observed toxicity values for a given interval of predicted toxicities 
is obtained that describes the prediction's uncertainty for a related class or interval. 
Using a new compound’s most likely prediction from the prediction interval, for ex- 
ample, this value would decide in which prediction class the compound would fit into 
along with the class’ uncertainty given by the interval of past experienced toxicity 
values. Figure 8 plots the results of a derived decision model for q=12 classes from 
the predictions of the combined model reported in 3.3 and table 3 lists the underlying 
data of fig. 8 for reference. For comparison, the results based on the minimum (most 
toxic) predictions of the 13 individual models of section 3.2 are shown in fig. 9. Ta- 
ble 4 shows the accuracy values for these two decision models compared to a mean- 
based model. 

Demetra Combined 




number of dass members 

|- Mean_Tox and tox. Interval ■ Mean_Pred of a class | 

Fig. 8. Decision model based on the predictions of the combined model 



Table 3. Underlying data of the decision model of fig. 8 



Class 


Number of 
class mem- 
bers 


From 

predicted 

toxicity 


To pre- 
dicted 
toxicity 


Min. 

observed 

toxicity 


Mean ob- 
served 
toxicity 


Max. ob- 
served 
toxicity 


Mean 

predicted 

toxicity 


i 


6 


-6.90 


-6.24 


-7.74 


-6.23 


-5.62 


-6.49 


2 


3 


-6.24 


-5.59 


-6.27 


-6.03 


-5.79 


-5.88 


3 


8 


-5.59 


-4.93 


-6.84 


-5.45 


-3.98 


-5.26 


4 


7 


-4.93 


-4.27 


-5.24 


-4.50 


-4.02 


-4.57 


5 


23 


-4.27 


-3.61 


-5.58 


-3.87 


-2.13 


-3.89 


6 


35 


-3.61 


-2.95 


-5.02 


-3.37 


-1.64 


-3.31 


7 


53 


-2.95 


-2.29 


-4.40 


-2.66 


-0.47 


-2.61 


8 


69 


-2.29 


-1.63 


-3.66 


-1.83 


0.36 


-1.95 


9 


44 


-1.63 


-0.97 


-3.10 


-1.46 


0.12 


-1.33 


10 


21 


-0.97 


-0.31 


-3.27 


-0.60 


0.30 


-0.74 


11 


8 


-0.31 


0.35 


-1.09 


-0.15 


0.43 


-0.10 


12 


4 


0.35 


1.01 


-0.10 


0.32 


1.33 


0.66 
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Demetra Min. Tox. 




4 7 4 5 10 27 41 51 65 41 21 5 

number of class members 



|- Mean_Tox and tox. Interval ■ Mean_Pred of a class | 



Fig. 9. Decision model of 12 classes based on the minimum predictions of 13 individual models 



Table 4. Accuracy of three decision models for trout 





Min. Tox. 

vs. 

Mean Pred. 


Mean Tox. 
vs. 

Mean Pred. 


Max. Tox. 
vs. 

Mean Pred. 


R- decision model fig. 8 


0.51 


0.99 


0.68 


R 2 decision model fig. 9 


0.75 


0.79 


0.0 


R 2 decision model using the mean prediction of 
13 models (not displayed) 


0.4 


0.97 


0.5 



The result in table 4 confirms the expectation that the combined model shows a 
higher performance than just using the mean of a number of individual models. 



4 Conclusions 

The current results and conclusions are primarily based on the Demetra data set, but 
several other toxicity data sets have been investigated, also. 

1 . Animal tests run to obtain the data source for toxicity QS AR modelling are de- 
scribed by a complex, nonlinear dynamic ecotoxicological system. The mortality 
rate of a certain species as an observed output variable of this system, however, is 
not object of toxicity modelling. Instead, an input variable of the test system - the 
external disturbance LC 50 (lethal concentration or dose) - is modelled by a pol- 
lutant’s molecular structure. The system’s observed output variable, the mortality 
rate y, is mapped by a single pair of observations (LC 50 , y) and, therefore, is de- 
scribed by a linear static model a priori. This, in fact, is a strong simplification of 
the ecotoxicological system. 



234 Frank Lemke. Johann-Adolf Muller, and Emilio Benfenati 



2. Since different values are measured for LC 50 that can vary up to a factor of 4 
when running multiple tests it is also not exactly clear, which of these values can 
be seen as the “true” value for modelling. This value as the models’ target vari- 
able, however, has an important impact on model results both predictive and de- 
scriptive, which finally means uncertain model results. 

3. The used input information for modelling does not reflect very appropriated the 
desired input-output relation of the complex ecotoxicological system and this re- 
sults in highly noisy data. Observing additional characteristical state or external 
variables of the test system during the animal tests may significantly reduce the 
data's noise and thus the models’ uncertainty. The modelling approach should be 
improved to better cover the system's non-linear and dynamic behaviour. 

4. Applying GMDH for multileveled self-organisation and model combining turns 
out a very effective and valuable knowledge extraction technology for building 
reliable and interpretable models, objectively, in short time frames from noisy 
and high-dimensional data sets, directly. Also, the obtained models are easy to 
implement in other runtime environments for application and reuse. 

5. Decision-making has to take into account the models’ uncertainty. Prediction and 
toxicity intervals obtained by applying many alternative models are one efficient 
way to fit this goal inherently. 
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Abstract. Predictive chemical models, commonly called quantitative structure- 
activity relationships (QSAR), are facing a period of changes and challenges. 
There is a transition from classical models to new models, more sophisticated. 
Meanwhile, there is an increased interest in regulators on QSAR in fields as 
toxicity assessment. This requires more standardisation, even though the re- 
search is very dynamic and no common opinion exists on many issues. The 
present article is a contribution to the discussion on how to standardize data and 
compare models, with a special attention to advanced QSAR methods, identify- 
ing the problems and targets in the field. 



1 Introduction 

Models to predict activity and properties of chemicals on the basis of the chemical 
structure have been studied since many years, and usually they are called Quantitative 
Structure- Activity Relationship (QSAR) [1]. More recently a series of powerful ad- 
vanced computer tools have been introduced [2] . 

The case of toxicity prediction is compounded by many factors, which make it a 
difficult task. Briefly: 1) toxicity data is noisy, and its availability limited because 
experiments are expensive; 2) the knowledge on the chemical important information 
is missing in most of the cases, so that typically many (up to thousands) chemical 
descriptors are calculated, to then identify the important ones; 3) the presence of a 
high number of chemical descriptors complicates the mathematical treatments of the 
problem, and introduces noise. 

Some comments on these issues are necessary. 1) The toxicity experiments used 
for computer modelling are the same used for the current evaluation of the chemicals 
and are considered acceptable for the common toxicological and ecotoxicological 
assessment. A great part of the variability is due to natural factors, as an effect of the 
variability between different organisms. 2) Information on some important chemical 
factors involved in the toxicity phenomena is available. 3) Powerful computational 
tools can offer new possibilities. 

However, the newly introduced techniques require a careful evaluation of their ca- 
pabilities, mode of use, and possible mistakes. There can be risk of misuse of meth- 
ods, which are not fully understood. 
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In this article we will discuss how to evaluate the new advanced techniques used 
for QSAR. We will discuss some relevant topics, and give examples taken from some 
recent European research projects funded by the European Commission. In Section 2 
we will describe the toxicological data, its properties, variability and how it can affect 
the models. We will also discuss its standardization, with some examples. In Section 
3 we will deal with chemical information, its variability and standardization, provid- 
ing examples. Section 4 will discuss why and how to compare mathematical models. 
In Section 5 we will give some examples of comparison of models. Section 6 will be 
conclusions. 



2 Toxicity Data for Models 

The basic information required for QSAR on toxicity is the toxicological and chemi- 
cal information. The assumption is that the toxicity is due to the chemical structure, 
which can fully explain the toxicity effect, through a suitable algorithm. 

2.1 Data Availability and Variability 

Toxicity data are the basement of toxicity models. If they are unreliable, the model 
will be unreliable. There is low availability of toxicity data, because experiments are 
expensive and time consuming. Nowadays ethical issues make availability of new 
data more problematic. Furthermore, some experiments have been conducted accord- 
ing to standardised protocols, but many of them, mainly in the past, did not followed 
verifiable procedures. A further problem is that in many cases toxicity is expressed as 
a greater than value. 

Toxicity values, as other biological values, are variable. Indeed, one factor of the 
observed variability is natural since different individuals respond in different ways to 
chemicals. However, another source of variability is related to the protocol used, 
which can be different for exposure time, route of administration, species, weight and 
sex of the animal, for instance. Also the chemical used for the test, even if nominally 
the same in different experimental studies, can have different purity [3]. 

It is important to assess the reliability of the source. A careful comparison and 
choice of the value to be adopted is also very important, better if supported by expert 
judgement to help deciding which to consider the most reliable. Data from different 
laboratories are acceptable, if laboratories used the same experimental protocol. The 
use of data coming from a single laboratory or researcher can improve reproducibil- 
ity, but it should not be given as a must, because it can introduce a bias, and theoreti- 
cally the need to use data from such a limited source is against the definition of a 
standardized protocol. Indeed, the protocol should report all necessary experimental 
conditions to be adopted. If it does not describe some critical parts, it means that it is 
not valid. If results were necessarily related to the operator, we could model not only 
the endpoint, but also the operator! 

Also using more selected databases some variability is expected. Here the problem 
is which value to choose. This is not a mathematical issue, but a toxicological one, or 
better it is related to regulatory decisions, as discussed below. 
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2.2 The Toxicity Data Discussed in the Present Article 

Below we will discuss more in detail some studies, which used the following data 
sets. 

EPA Duluth Data Set. This data set has been derived from the database prepared by 
the U.S. EPA (Environmental Protection Agency) in Duluth and it consists of 568 
industrial organic compounds. The toxicity is expressed by median Lethal Concentra- 
tion values (LC 50 ) for 96 hours flow-through exposures referred to juvenile stage of 
Fathead Minnow [4], The database also reports for every compound the respective 
MOA and its chemical class. The quality of this database is quite high, since experi- 
ments have been done according to a well-defined protocol. It is one of the largest 
databases of ecotoxicological values, and includes substances belonging to a wide 
range of chemical classes. As common for QSAR studies, dose in the studies men- 
tioned below are expressed as LC 50 in mmol/L, while the original toxicity values are 
expressed in mg/L. This data set has been used within the EC funded project 
IMAGETOX [5], in which several groups worked on this data set using different 
models. Later on, this data set was used within the EC project OpenMolGRID [6]. 

EPA-OPP Data Set. This data set was developed from the database of EPA, Office 
of Pesticide Programs (EPA-OPP), which kindly provided it. The purpose for devel- 
opment of this database has been to make more readily accessible a current up to date 
summary of EPA reviewed data corresponding to the ecotoxicological effects of all 
pesticide active ingredients presently registered or previously manufactured in the 
U.S. for the greatest diversity of species possible. 

Toxicity data for this database are drawn from several sources and then reviewed. 
We used only data of the higher quality, obtained in studies conducted according to 
standardized protocols. Here we will discuss about a data set for the rainbow trout 
( Oncorhynchus mykiss) acute toxicity LCso-96h exposure expressed in mmol/L. 

As in the cased of the EPA Duluth data set, the data we used are of high quality. 
The variety of chemical moieties is higher than in the EPA Duluth data set, which 
seems to be easier, because larger and simpler on a chemical point of view. From the 
EPA-OPP database we developed and used a data set within the EC project 
DEMETRA [7]. This data set will be called the DEMETRA data set. 



2.3 Standardization of Toxicity Data 

As we discussed above, there are many possible sources of toxicity data, but re- 
searchers should use data obtained from experimental studies done according to stan- 
dardized protocols. This is not always possible. Here we will show how we ap- 
proached the issue of preparing a good data set. 

The EPA Duluth data set is an excellent data set, for the reason given above. Still 
is not yet usable. Data obtained at EPA in Duluth are included in the ECOTOX data- 
base [8], which contains data from different sources. This database for many com- 
pounds reports more than one toxicity value, and some experimental conditions are 
different. Again, expert judgment is necessary to evaluate the data and the possibility 
to merge compounds with experiments done in slightly different conditions in order 
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to increase the size of the data set. It is common to obtain a chemical list which, for 
some chemicals, contain more then one toxicity value. At this point we have to 
choose which toxicity value to use. The choice is done on the basis of the use of the 
model. For instance, Russom et al. chose the median one [4], Another choice is the 
lowest value [9], which is in agreement with the conservative principle adopted by 
the European Union [10]. Regulators prefer to use a value, which give more safety. 

This shows that it is not enough to have a good database, but further selection is 
necessary, and this influences model results and uses. We further explored this topic. 
Within the project DEMETRA we studied the way to improve the quality of the data, 
using the DEMETRA data set [IT]. Studying the case of rainbow trout toxicity, we 
pruned the data, even if obtained according to a standardized protocol. In particular, 
data on formulates with low percentage of active ingredient (<85%) or mixture of 
more than on ingredient was discharged. Furthermore, if variability of the reported 
toxicity data for the same compound was greater than a factor of four, we eliminated 
the pesticide from the data set. This was done to eliminate chemicals, which showed 
some problems in experiments, resulting in variability higher than a factor of four. 
For the other chemicals we chose the lowest toxicity value, if more than one value 
was given. 

Then, we compared the toxicity values so obtained with two other databases, 
which have been developed for pesticides. One originated from the German 
Bundesoberbehorde und Bundesforschungsanstalt, and the second one was had been 
prepared within the EC project SEEM, co-ordinated by the ICPS (International Cen- 
tre for Pesticide Safety), Busto Garolfo, Italy. Both databases have high quality data. 
Pesticides considered were about 300 from the EPA-OPP database, and for about 50 
of them we found values in the German database, while the overlap between the 
EPA-OPP and SEEM database was for about 100 compounds. We eliminated com- 
pounds showing variability greater than a factor of four, as above done considering 
only data internal to the EPA-OPP database. This further quality check step, compar- 
ing high quality databases, is an additional improvement, since we used only data 
with an excellent reliability. 282 pesticides were finally used for the DEMETRA data 
set. 



3 Chemical Data 

Chemical variability is much more under control, in principle. Indeed, chemical de- 
scriptors come from experiments or calculations. Experiments are much easier and 
reproducible than toxicity experiments, as well as calculations. However, also some 
experimental measurements of chemical properties suffer from variability. LogP is 
the logarithm of the partition coefficient between octanol and water, and is the most 
used chemical descriptor in QSAR studies of aquatic toxicity studies since it corre- 
lates with water solubility, soil/sediment adsorption coefficients and bioconcentration 
factors for aquatic organisms [12]. Partition coefficients can be measured experimen- 
tally by several techniques, ranging from the simple “shake flask” technique to popu- 
lar chromatographic methods. Several factors are involved in the variability of ex- 
perimental measures, related to the chemical or the method itself, including chemical 
purity, experimental protocol and typing errors. For instance, in a study on pesticide 



Modelling Aquatic Toxicity with Advanced Computational Techniques 239 



logP 13 compounds out of 235 had a standard deviation bigger than one logP unit 
and, since these compounds show an acid-basic property, the experimental measure- 
ments may not have been made under the proper conditions to ensure the neutrality of 
the compounds [13]. Thus, the quality of experimental data is not optimal, and sev- 
eral problems arise. Furthermore experimental values are not always available, and by 
definition they are not available for a compound which has never been synthesized. 
However, chemical companies in many cases are interested in knowing properties of 
compounds to be synthesized, because the preparation of the chemical compound can 
require long time and be expensive. 

For all these reasons calculated descriptors are more appealing. However, also in 
the case of calculated molecular descriptors there is a risk of variability for 1 ) lack of 
standardisation and 2) individual procedure to optimise the structures. 

Calculation of chemical descriptors is not straight, but typically involves several 
steps, which vary depending on the software used. Typically, more than one software 
is used in successive steps. It can happen that the same chemical descriptors vary on 
the different software. This refers also to some simple ones, such as number of double 
bonds, because they are linked on the way the software calculated aromaticity. Thus, 
a benzene ring can contain six aromatic bonds, or three double bonds and three sim- 
ple bonds, which is wrong. The case of the benzene ring is a simple one, which usu- 
ally is correctly calculated, but in some heterocyclic rings aromaticity is more border- 
line. Another tricky point is related to the stereochemistry, and issues such 
enantiomers, diasteroeisomers etc. Finally, in the case of three-dimensional (3D) 
descriptors, many of them are highly sensitive to the optimisation of the structure [3]. 
The computational chemist according to his experience typically does optimisation. 
All these features introduce quite a high amount of variability. 

It is important to note that some variability is dependent on the procedure, and 
once the procedure is defined and fixed (for instance the format of the chemical struc- 
ture and the software used to calculate descriptors) the same value should be obtained 
in a reproducible way, even though using a different procedure other values can be 
obtained. However, in the case of 3D descriptors, their reproducibility can be in- 
creased, but variability will remain, if humans calculate them. 

Within IMAGETOX, for the EPA Duluth data set, we evaluated reproducibility of 
results measuring descriptors in four laboratories, using the same or different soft- 
ware [14]. For optimisation of the chemical structure we used the following methods: 
PM3 with HyperChem or MOP AC, AMI with VAMP (in TSAR) or MOP AC, and ab 
initio methods (HF/6-31G** and B3LYP/631G**) using Gaussian. On the chemicals 
so optimised, many descriptors were calculated using CODESSA, TSAR, QSARis, 
HyperChem, and proprietary software. Some descriptors were in common among all 
laboratories. Total Energy and heat of formation gave excellent agreements, close to 
100%. For LUMO values the squared correlation coefficient, r 2 , varied between 99.8 
and 79.9. Greater agreement was observed between values obtained with AMI and 
PM3, while descriptors using ab initio methods for optimization gave values less 
comparable with those obtained with AMI and PM3. Similar results were obtained 
for HOMO. Molecular surface area gave r 2 between 0.97 and 0.77. Similarly, r 2 for 
molecular volume was between 0.98 and 0.78. For dipole moment r 2 was between 
0.83 and 0.39. Thus, some descriptors can give very different results, and the main 
factors are the molecular conformation and the software used to calculate them. 
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Within DEMETRA, in order to achieve high quality chemical information, we 
adopted the following protocol [15]. The chemical domain was defined. We dis- 
charged inorganic compounds and mixtures of chemicals with different molecular 
weight. For mixtures of compounds with the same molecular weights, different rules 
applied for tautomers, enantiomers and diastereoisomers, depending on the chemical 
descriptors to be calculated (2D or 3D). Salts were transformed eliminating the inor- 
ganic part. Three different laboratories checked the chemical structure. It is not un- 
common, indeed, that some chemical structures present in databases are wrong. For 
conformation, we used as far as possible crystallographic data. In case of tautomers, 
for the purpose of building the high quality data set, we independently optimised the 
two isomers with ab initio methods, and chose the more stable isomer. Similarly, the 
more stable isomer was chosen for equatorial/axial arrangement of saturated ring 
systems. 

All these time consuming steps adopted for the DEMETRA data set, for both tox- 
icity values and chemical structures, have been presented to give an idea of the proc- 
ess in data quality assessment, which can be necessary. The protocol we adopted is 
the best we could do, and is by far more robust than those commonly used for QSAR 
modelling. It is possible that this level of quality assessment does not improve the 
final model in a great extent. Still, we preferred to start within DEMETRA with data 
of quality as high as possible, and for this we defined the protocol above described, 
which can be used as a candidate protocol for QSAR modelling. 

We mention above a very promising perspective in view of standardized methods. 
We showed above that it is common to obtain different chemical descriptors due to 
the use of different software or for the involvement of humans in the calculation of 
3D descriptors. Within the EC funded project OpenMolGRID an automatic way to 
obtain QSAR models has been developed, which includes the steps of calculation of 
chemical descriptors, including 3D ones, starting from the 2D structure. So far we 
have preliminary results, because the project is still on going. This approach is very 
appealing, since it allows obtaining very standardized and automatic models and 
chemical descriptors. Thus, it is in principle the method of choice for reproducible, 
easy models, suitable for regulatory purposes because the variability related to the use 
of different software or human procedure is eliminated. Preliminary studies showed 
that OpenMolGRID software is able to optimize most of the structures of the EPA 
Duluth data set; exceptions were due for the use within OpenMolGRID of software 
packages, which cannot deal with some chemical structures, for instance containing 
tin. This is not related to OpenMolGRID itself, but to the use within OpenMolGRID 
of the software MOP AC, one of the most popular in the field of computational chem- 
istry. 



4 Comparing Models 

4.1 Why to Compare Models? 

In this section we will discuss about the comparison of the different models used for 
QSAR. The evaluation of their advantages and disadvantages considering studies 
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published in the literature is quite difficult, since studies uses different data sets, 
chemical descriptors, algorithms, etc. 

Furthermore, also using the same data set, comparison can be difficult. We showed 
above in detail the critical points, which can deeply affect the toxicity data and the 
chemical information. This has to be the first analysis in the comparison of models, 
since toxicity and chemical values deeply influence final results. Now we continue 
the comparison. It is useful to evaluate the reasons to compare models and below we 
indicate some of them. 

• We can be interested in a better knowledge and in an evaluation of some of the 
components of a model, for instance which chemical descriptors are more suitable, 
or which software more powerful. This interest is a technical one, more addressed 
to the mechanism of a model than to its use. Generally, in this kind of study the re- 
searcher pays more attention to a new chemical descriptor or algorithm than to the 
use of the obtained model. The model itself is an example to test descriptors or al- 
gorithms, which can than be applied to other models. Thus, this interest is more 
general, than specific and dedicated to an application. 

• Vice versa, the interest can be more addressed to the model, to its use and real 
world application. In this case, the attention is not deeply addressed to the tech- 
niques, but to results. There can be at least two kinds of interests for model results. 
In one case the interest is basically for the prediction of the toxicity value of a 
chemical compound. This is, for instance, the interest of a decision maker. In the 
other perspective the interest is more in understanding the mechanism of a given 
toxic molecule. This is more likely the case of a researcher. The model for this 
purpose can be very different from that used for predictive purposes. 

• Increasing the complexity, we can be interested in the different models with the 
aim to combine them into a hybrid architecture. This strategy is more apt to predic- 
tive models than to mechanistic studies, because if we combine individual models 
we can easily loose the transparency necessary to obtain models to understand the 
toxicity mechanism. However, there are integrated models in which, for instance, 
the correct conformation obtained for docking studies is used to better optimise 
other models [16]. 

Thus, different interests are possible and they address the comparison towards dif- 
ferent features of the models. For instance, mechanistic studies require transparency, 
while decision-makers can be more pragmatic. A theoretical mathematician is more 
interested to a standardised assessment of the algorithm, while models used for regu- 
latory purposes introduce weights to prefer, for instance, false positive versus false 
negative. Thus, these different interests tend to introduce assessment tools more suit- 
able for the different purposes. 

4.2 How to Compare Models? 

Now we can evaluate the tools for the different assessments. We can evaluate 1) 
model performances, 2) outliers obtained in the different models, 3) descriptors used 
for modelling purposes, 4) model uncertainty, 5) definition of the model domain. 

Model Performances. The first way to compare models is to evaluate if the model 
correctly calculates the chemical toxicity. This is the most ancient way, which gives 
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an estimation of the fitting of the model. In past QSAR models the correlation coeffi- 
cient was used. Correlation coefficient is a measure of the association of values of x 
and y; it gives the dispersion of data points around a straight line. A more useful 
measurement is its square, r 2 , the Coefficient of Determination (COD). However, 
both are too much optimistic. In general the performances in fitting do not warrantee 
good results when using the same model for prediction of new values. This risk is 
higher with recent QSAR models, which have a wider chemical domain (classical 
QSAR models were more linked to chemicals belonging to a restricted chemical 
class), and make use of many more chemical descriptors and of algorithms such as 
neural networks. Indeed, “as long as the net structure has enough complexity, a neural 
net can be trained to produce any desirable error level on the training set.” [17]. 

To obtain more robust models, other ways to measure performances have been in- 
troduced, such as the leave-one-out method, q : or r 2 cv [18]. 

A simple extension of this approach uses different numbers of objects left out for 
validation (leave-more-out) [19]. Increasing the number of objects to be evaluated 
(and reducing the number of inputs - chemical descriptors), the reliability of the pre- 
dictive model improves. Error evaluation can be measured and a more robust method 
to evaluate performances is the R 2 , the squared multiple correlation coefficient. It is 
also called the Coefficient of Determination. Objects can be used only once in the test 
sets, or can be used more than once, within different validation tools. 

There are several articles, which adopted an external data set not used for training. 
A debate is ongoing on the advantages and disadvantages of the different methods for 
predictive models, considering the critical issue that the number of compounds is very 
limited in the case of toxicity QSAR. To obtain more credible models an external data 
set should be used, but problems are that this reduces the number of compounds for 
building the model (because part of them are left for the validation set), and the popu- 
lation of the validation set should be as much as possible identical to that of the train- 
ing set. For instance, Arciniegas et al. stated that “in order to determine a net’s ability 
to generalize, it must be evaluated on a test data set which was not used during the 
training” [17]. Several authors rose concern about fact that the leave-one-out ap- 
proach can be too much optimistic [20]. Some interesting guidelines for validation are 
given in [21], 

Still, all these efforts for performance measurement are not yet sufficient for the 
end-user, because these methods do not distinguish between positive and negative 
error. However, the presence of false positive or false negative is very different. For 
regulators it is very important to avoid false negative, in order to avoid acceptance for 
use of compounds, which actually have toxicity problems. Vice versa, false positive 
can be a problem for industry, which can be interested in avoiding the elimination of 
compounds from the marker because foreseen with negative effects. Thus, a different 
evaluation of models, which give positive or negative errors, should be given. 

Model Outliers. A different tool to evaluate models is observing outliers. Comparing 
outliers obtained with different models can be very instructive. If outliers are the 
same it means that models are equivalent, in this regard. Outliers can be very useful 
to understand which mechanism the model explains. It can happen that outliers be- 
long to the same chemical class. Thus, we can split the data set into a greater part 
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modelled by the basic model, and a smaller part, which can then be addressed by a 
second model. However, in most of the cases this splitting is not easy. 

It can be that model performances are the same, but outliers are different. In this 
case it can be very useful to combine the different models, to reduce the total number 
of outliers. 

Descriptors. We can also compare descriptors used in the model. We can compare 
the number of descriptors, and their nature. Generally speaking, it is preferable to use 
a low number of descriptors, to increase model generalization. 

About descriptor nature, the interest can be different. In one case we may prefer 
models using descriptors with better characteristics, which can be decided by the 
user. Possible criteria are the use of descriptors faster to be calculated (avoiding 3D 
descriptors, which are much more time consuming to be calculated). Another crite- 
rion can be the choice of descriptors more transparent and explicative of the mecha- 
nism. In this case some 3D descriptors can be preferable to certain topological, easier 
descriptors. Another criterion can be the use of more reproducible results: we showed 
above that some descriptors have lower reproducibility, but unfortunately this crite- 
rion is not typically taken into account. 

However, researcher can also work in a different perspective, not exclusive and 
aimed to prefer some descriptors, but integrative, aimed to take advantage of different 
descriptors, which can contain different chemical information. This perspective is 
more interesting in the case of hybrid systems. 

Model Domain. The model domain is another important way to compare them. “No 
model can be expected to extrapolate successfully, yet it is not always obvious what 
predictions are extrapolations and what are interpolations” [21]. 

In the case of focused models (e.g. models for phenols) the domain for inclusion is 
clearer, but very often it is not specified when to exclude a chemical. However, the 
domain should come with the description of the QSAR model, and some commercial 
software do indeed report if the chemical evaluated by model is outside the optimal 
predictive space. Of course it is easier to develop models for a limited chemical do- 
main than for more complex lists. Taking into account this fact, an approach to cope 
with complex lists of chemicals is to split the chemical domain into simpler sub- 
domains. Then, sub-models are developed, which are integrated into an overall sys- 
tem [22, 23]. 

Model Uncertainty. Typically, models do not evaluate their uncertainty. However, it 
is more and more important to assess uncertainty of a model, which is useful informa- 
tion for the end-user. Also on a theoretical point of view, it is preferable to have a low 
uncertainty, thus model can be evaluated also on this basis. 

The increased interest of probabilistic models for risk assessment will require the 
use of uncertainty for QSAR models eventually adopted. 

A major component of uncertainty is toxicity data, as we have discussed above. 
The QSAR model will have at least the uncertainty of the toxicity data it uses. Also 
for this reason we were interested in uncertainty of toxicity data in the DEMETRA 
data set. This will give us a defined basis to assess uncertainty. 
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A second component in the model uncertainty is the chemical information [3]. We 
have shown above variability of the 3D chemical descriptors within different labora- 
tories. The variability of 2D descriptors is irrelevant, provided that the same protocol 
is used (structure format, software, etc.). At this point, the model has all elements to 
evaluate uncertainty, including what is due to the algorithm itself. 



5 Examples on Model Comparison 

We will discuss here same examples relative to the EPA Duluth data set. This is a 
quite large and reliable data set, on a toxicological point of view. Thus, it can be use- 
ful to discuss the results obtained with different methods, within the European pro- 
jects IMAGETOX and, preliminarily, OpenMolGRID. Some studies have been done 
on the complete data set, evaluated as a monolithic, single system. Other studies 
evaluated the results obtained after splitting of the data set. 

5.1 Comparing Performances 

Using BMLR (Best Multi-Linear Regression) procedure implemented in Codessa 
[24] models on the whole data set gave the following results, after excluding five 
outliers: square correlation coefficient r 2 % = 69.5; r 2 cv % = 69.3 [25]. The model only 
used two chemical descriptors: Log P and E LUM o- Splitting the whole data set into a 
training (369 compounds) and external (189 compounds) set results were r 2 cv % = 64.1 
for the training set and r 2 % = 69.3. This shows that multivariate techniques are quite 
simple, robust and reproducible. 

Preliminary results using software developed within OpenMolGRID gave similar 
results. This is quite interesting, since it shows the possibility to obtain QSAR models 
in an automatic way, starting from the simple 2D structure, without human involve- 
ment for the chemical structure optimisation, descriptor calculation and QSAR mod- 
elling. 

Another study was done using Statistical Learning Network approach employing 
the concept of an inductive multi-leveled self-organization for autonomous creation 
of optimal complex models [26]. This modeling technology follows the idea of Ac- 
tive Neurons that inductively self-optimizes the neuron’s objective function on a low 
model complexity level. The model results expressed with the coefficient of determi- 
nation gave R 2 % = 0.74. Furthermore, a model using 400 compounds has been done 
and testing the model on the remaining 168 compounds. Results on the training test 
was R 2 % = 0.75. Thus, this approach gave better results than those obtained with 
linear methods. 

Other models have been developed after splitting the data set into sub-sets for the 
different mode of action (MOA). Chemicals in the Duluth data set have been split 
according to different toxicological behaviour by Duluth researchers [4]. Using 
BMLR we modeled the EPA Duluth data set according to MOA. Compounds show- 
ing narcosis MOA were predicted with r 2 cv from 0.81 to 0.91. Vice versa, compounds 
with reactive MOA had a r 2 cv = 0.58. This is in agreement with general aquatic toxic- 
ity models which showed that it is much easier to model narcotic compounds, which 
are well modeled by logP [12], 
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The development of models based on MOA has the advantage to be linked to ex- 
perimental studies related to a toxicity mechanism studied by the EPA researchers 
and other groups in Europe. However, it has some disadvantages. Indeed, by defini- 
tion the MOA has been defined experimentally by toxicologists. Thus, models based 
on MOA are mechanistic, aimed to explain a biological process, but cannot be used 
for unknown compounds. To overcome this limitation, rules have been introduced to 
derive some chemical rules to predict which MOA based model should be used for 
predictive purposes [4], Another critical point is that quite a large number of com- 
pounds do not have a defined MOA (111 out of 568, in the EPA Duluth data set). 

A different approach, studied within the 1MAGETOX project, is to split chemicals 
according to chemical classes [25]. This splitting has the advantage that can be de- 
fined also for unknown. We split the EPA Duluth data set into 13 chemical classes. 
Results were generally better than those on the complete data set. r 2 cv was always 
above 80% and in many cases above 90%. Exceptions were alcohols and aldehydes, 
the last ones with results not acceptable (below 50%). 

Aldehydes are indeed a quite difficult case. For this reason we dedicated special at- 
tention to this particular chemical class, using 1) a more careful molecular optimiza- 
tion and chemical structure study, including ab initio methods, and BMLR; and 2) a 
different way to describe the chemical information: correlation weights of neighbour- 
ing codes. In the last case, using an external test set of 50% of the aldehydes, we 
obtained an r z = 0.64 [27], In the case of the BMLR the best predictive model was 
obtained with the HF/STO-3G model (r 2 cv = 0.84), while with the semi-empirical 
methods a good predictivity was observed with the PM3 based model (r 2 cv = 0.77) 
[28], Similar results were obtained splitting the data set in three parts, and using one 
third of the compounds for validation. This shows that how to extract the chemical 
information can be very important, mainly for focused situations. 

In a different approach data mining techniques were applied to the same chemical 
classes studied above [29]. We used different selection algorithms for descriptor 
reduction. Linear regression was done. Performances (as r 2 ) were similar or better 
than those with BMLR, and in particular in the case of aldehydes were 0.61, but using 
a higher number of chemical descriptors. Then we developed a classifier for chemical 
classes, training different classification algorithms. Inputs were chemical descriptors. 
The best results were obtained by applying a meta-classifier scheme applied to the 
J48-algorithm. In this way the software, not the human expert, assigns chemical 
classes. Correct classification was more than 85%. The output of this classifier was 
then used to select the appropriate toxicity model for each compound in the data set. 
After combining the different sub-models the results improved considerably, r 2 of the 
combined model was above 0.8. 

A further step towards an automatic model was done avoiding the use of chemical 
classes, as defined by human experts, and learned by the classifier. Instead, the model 
was a hybrid architecture using unsupervised neural networks for clustering, and then 
supervised neural networks for toxicity models [29]. Chemical descriptors were used 
in both cases. The optimal number of cluster for the EPA Duluth data set was nine. 
Performances of the model were assessed with an external set of 20% of the chemi- 
cals. r 2 on the external set was 0.76. 

These studies show methods which take advantage of advanced knowledge engi- 
neering methods. We also evaluated how much improved chemical information can 
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affect results on the total EPA Duluth data set. It is interesting to compare results on 
models developed using descriptors starting from structures based on ab initio calcu- 
lus, with those relative to a semi-empirical method AMI. We compared results de- 
veloping models with GMDH approach above introduced. The models based on ab 
initio calculus were not better than those on AMI [26]. This can appear unexpected, 
for the more precise information in principle arising from ab initio. The probable 
reason for the equivalence of the results is that toxicity data contain a high variability; 
furthermore, the wide diversity of chemical structures present in the data set increases 
the complexity of the process under study. Indeed, results on the more homogenous 
subset of aldehydes, discussed above, showed that ab initio models were better, but 
not always, depending on the ab initio algorithm for the calculus. 

5.2 Other Parameters 

The outliers of these models are often very similar. They include small compounds 
(as acrolein), which very often are difficult to be modelled. The use of more local 
models can keep into account these peculiar situations, as discussed above for alde- 
hydes. 

Different descriptors were used in the different models. LogP is quite common in 
many models, but the high redundancy of similar descriptors can explain why they 
differ in the different models. 

The current models have not fully studied the issue of the model domain. A major 
reason for this is the limited availability of experimental data. 

Uncertainty has been addressed only within the DEMETRA data set, in a prelimi- 
nary way, since the project is still on going. We discussed above the theoretical basis 
on the way we addressed this. 



6 Conclusions 

We discussed the way to standardize data and compare predictive models. A first step 
is a solid knowledge on data reliability. We presented a proposed candidate protocol 
to produce high quality checked data sets for toxicity values. Besides the availability 
of good quality data, a robust way to select appropriate toxicity values is often neces- 
sary, and we commented the way to choose them. Knowledge on the way and mean- 
ing of the toxicity experiments is often necessary. Furthermore, the purpose of the 
model should be defined, in order to make the final decision on the value. 

Similarly, we introduced protocols for high quality chemical data, using a human 
procedure (according to the DEMETRA protocol) or an automatic one (according to 
the OpenMolGRID approach). In any case the chemical structure has to be checked. 
Then, in the protocol done by human experts we discussed the definition of the 
chemical domain. There are some tricky points in structure design, such as the 
tautomery and stereochemistry of the compounds. Besides this source of error, 
chemical descriptors can vary, depending on their nature. 

We introduced five specific parameters for model comparison and discussed ex- 
amples: performance scores, outliers, chemical descriptors, model domain and uncer- 
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tainty. These parameters offer flexible tools to evaluate models, also depending on 
the use of the QSAR model, which can be for predictive purposes or mechanistic 
studies. 
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